Building a wavecache

ABSTRACT

A microarchitecture and instruction set that supports multiple, simultaneously executing threads. The approach is disclosed in regard to its applicability in connection with a recently developed microarchitecture called “WaveScalar.” WaveScalar is a compiler that breaks a control flow graph for a program into pieces called waves having instructions that are partially ordered (i.e., a wave contains no back-edges), and for which control enters at a single point. Certain aspects of the present approach are also generally applicable to executing multiple threads on a more conventional microarchitecture. In one aspect of this approach, instructions are provided that enable and disable wave-ordered memory. Additional memory access instructions bypass wave-ordered memory, exposing additional parallelism. Also, a lightweight, interthread synchronization is employed that models hardware queue locks. Finally, a simple fence instruction is used to allow applications to handle relaxed memory consistency.

RELATED APPLICATIONS

This application is a divisional application based on copendingapplication Ser. No. 11/284,760, filed on Nov. 22, 2005, which itself isa continuation-in-part based on copending patent application Ser. No.11/041,396, filed on Jan. 21, 2005, which itself is also based on priorcopending provisional application Ser. No. 60/538,603, filed on Jan. 22,2004 and Ser. No. 60/630,765, filed on Nov. 24, 2004, the benefit of thefiling dates of which is hereby claimed under 35 U.S.C. §§ 120 and119(e).

GOVERNMENT RIGHTS

This invention was funded at least in part with grants (No. CCR03-25635and No. CCF01-33188) from the National Science Foundation, and the U.S.government may have certain rights in this invention.

BACKGROUND

It is widely accepted that Moore's Law growth in available transistorswill continue for the next decade. Recent research, however, hasdemonstrated that simply scaling up current architectures will notconvert these new transistors to commensurate increases in performance.This gap between the performance improvements that are needed and thosethat can be realized by simply constructing larger versions of existingarchitectures will fundamentally alter processor designs.

Three problems contribute to this gap, creating a processor scalingwall. The problems include the ever-increasing disparity betweencomputation and communication performance—fast transistors, but slowwires; the increasing cost of circuit complexity, leading to longerdesign times, schedule slips, and more processor bugs; and thedecreasing reliability of circuit technology, caused by shrinkingfeature sizes and continued scaling of the underlying materialcharacteristics. In particular, modem superscalar processor designs willnot scale, because they are built atop a vast infrastructure of slowbroadcast networks, associative searches, complex control logic, andinherently centralized structures that must all be designed correctlyfor reliable execution. Like the memory wall, the processor scaling wallhas motivated a number of research efforts. These efforts all augmentthe existing program counter-driven von Neumann model of computation byproviding redundant checking mechanisms (see for example, the work by T.M. Austin, “DIVA: A reliable substrate for deep submicronmicroarchitecture design,” International Symposium on Microarchitecture,1999); exploiting compiler technology for limited dataflow-likeexecution, as disclosed by R. Nagarajan et al., “A design spaceevaluation of grid processor architectures,” International Symposium onMicroarchitecture, 2001; or efficiently exploiting coarse grainedparallelism, as proposed by K. Mai et al., “Smart memories: A modularreconfigurable architecture,” International Symposium on ComputerArchitecture, 2002, or as disclosed by E. Waingold et al., “Baring itall to software: Raw machines,” IEEE Computer, vol. 30, no. 9, 1997.

A Case for Exploring Superscalar Alternatives

The von Neumann model of execution and its most sophisticatedimplementations, out-of-order superscalars, have been a phenomenalsuccess. However, superscalars suffer from several drawbacks that arebeginning to emerge. First, their inherent complexity makes efficientimplementation a daunting challenge. Second, they ignore an importantsource of locality in instruction streams; and third, their executionmodel centers around instruction fetch, an intrinsic serializationpoint.

As features and cycle times shrink, the hardware structures that formthe core of superscalar processors (register files, issue windows, andscheduling logic) become extremely expensive to access. Consequently,clock speed decreases and/or pipeline depth increases. Indeed, industryrecognizes that building ever-larger superscalars as transistor budgetsexpand can be impractical, because of the processor scaling wall. Manymanufacturers are turning to larger caches and chip multiprocessors toconvert additional transistors into increased performance withoutimpacting cycle time.

To squeeze maximum performance from each core, architects constantly addnew algorithms and structures to designs. Each new mechanism,optimization, or predictor adds additional complexity and makesverification time an ever increasing cost in processor design.Verification already consumes about 40% of project resources on complexdesigns, and verification costs are increasing.

Untapped Locality

Superscalars devote a large share of their hardware and complexity toexploiting locality and predictability in program behavior. However,they fail to utilize a significant source of locality intrinsic toapplications, i.e., dataflow locality. Dataflow locality is thepredictability of instruction dependencies through the dynamic trace ofan application. A processor could take advantage of this predictabilityto reduce the complexity of its communication system (i.e., registerfiles and bypass networks) and reduce communication costs.

Dataflow locality exists, because data communication patterns amongstatic instructions are predictable. There are two independent, butcomplimentary, types of dataflow locality—static and dynamic. Staticdataflow locality exists, because, in the absence of control, theproducers and consumers of register values are precisely known. Within abasic block and between basic blocks that are not control dependent(e.g., the basic blocks before and after an If-Then-Else) the datacommunication patterns are completely static and, therefore, completelypredictable. Dynamic dataflow locality arises from branchpredictability. If a branch is highly predictable and almost alwaystaken, for instance, then the static instructions before the branchfrequently communicate with instructions on the taken path and rarelycommunicate with instructions on the not-taken path.

The vast majority of operand communication is highly predictable. Suchhigh rates of predictability suggest that current processorcommunication systems are over-general, because they provideinstructions with fast access to many more register values than needed.If the processor could exploit dataflow locality to ensure thatnecessary inputs were usually close at hand (at the expense of otherpotential inputs being farther away), they could reduce the average costof communication.

Instead of simply ignoring dataflow locality, however, superscalarsdestroy it in their search for parallelism. Register renaming removesfalse dependencies, enables dynamic loop unrolling, and exposes a largeamount of dynamic instruction level parallelism (ILP) for thesuperscalar core to exploit. However, it destroys dataflow locality. Bychanging the physical registers and instruction uses, renaming forcesthe architecture to provide each instruction with fast access to theentire physical register file, which results in a huge, slow registerfile and complicated forwarding networks.

Destroying dataflow locality leads to inefficiencies in modem processordesigns: The processor fetches a stream of instructions with a highlypredictable communication pattern, destroys that predictability byrenaming, and then compensates by using broadcast communication in theregister file and the bypass network, combined with complex schedulingin the instruction queue. The consequence is that modem processordesigns devote few resources to actual execution (less than 10%, asmeasured on an Intel Corporation Pentium III™ die photo) and the vastmajority to communication infrastructure.

Several industrial designs, such as partitioned superscalars like theAlpha 21264, some very long instruction word (VLIW) machines, andseveral research designs have addressed this problem with clustering orother techniques, and exploit dataflow locality to a limited degree. Butnone of these approaches make full use of it, because they still includelarge forwarding networks and register files. Accordingly, it would bedesirable to provide an execution model and architecture built expresslyto exploit the temporal, spatial, and dataflow locality that exists ininstruction and data streams.

The von Neumann Model: Serial Computing

The von Neumann model of computation is very simple. It has three keycomponents: a program stored in memory, a global memory for datastorage, and a program counter that guides execution through the storedprogram. At each step, the processor loads the instruction at theprogram counter, executes it (possibly updating main memory), andupdates the program counter to point to the next instruction (possiblysubject to branch instructions).

Two serialization points constrain the von Neumann model and, therefore,superscalar processors. The first arises as the processor, guided by theprogram counter and control instructions, assembles a linear sequence ofoperations for execution. The second serialization point is at thememory interface where memory operations must complete (or appear tocomplete) in order to guarantee load-store ordering. The elegance andsimplicity of the model are strikng, but the price is steep. Instructionfetch introduces a control dependence between each instruction and thenext and serves little purpose besides providing the ordering to whichthe memory interface must adhere. As a result, von Neumann processorsare fundamentally sequential.

In practice, of course, von Neumann processors do achieve limitedparallelism (i.e., instructions per cycle (IPC) greater than one), byusing several methods. The explicitly parallel instructions sets forVLIW and vector machines enable the compiler to express instruction anddata independence statically. Superscalars dynamically examine manyinstructions in the execution stream simultaneously, violating thesequential ordering when they determine it is safe to do so. Inaddition, recent work introduces limited amounts of parallelism into thefetch stage by providing multiple fetch and decode units.

It has been demonstrated that ample ILP exists within applications, butthat the control dependencies that sequential fetch introduces constrainthis ILP. Despite tremendous effort over decades of computerarchitecture research, no processor comes close to exploiting themaximum ILP present in applications, as measured in limit studies.Several factors account for this result, including the memory wall andnecessarily finite execution resources, but control dependence and, byextension, the inherently sequential nature of von Neumann execution,remain dominant factors. Accordingly, a new approach is needed toovercome the limitations of the von Neumann model.

WaveScalar—A New Approach

An alternative to superscalar architecture that has been developed isreferred to herein by the term “WaveScalar.” WaveScalar is a datafrowarchitecture. Unlike past dataflow work, which focused on maximizingprocessor utilization, WaveScalar seeks to minimize communication costsby avoiding long wires and broadcast networks. To this end, it includesa completely decentralized implementation of the “token-store” oftraditional dataflow architectures and a distributed execution model.Commonly assigned U.S. patent application Ser. No. 11/041,396, which isentitled “WAVESCALAR ARCHITECTURE HAVING A WAVE ORDER MEMORY,” describesdetails of this dataflow architecture, and the drawings andspecification of this application are hereby specifically incorporatedherein by reference.

The key difference between WaveScalar and prior art dataflowarchitectures is that WaveScalar efficiently supports traditional vonNeumann-style memory semantics in a dataflow model. Previously, dataflowarchitectures provided their own style of memory semantics and their owndataflow languages that disallowed side effects, mutable datastructures, and many other useful programming constructs. Indeed, amemory ordering scheme that enables a dataflow machine to efficientlyexecute code written in general purpose, imperative languages (such asC, C++, Fortran, or Java) has eluded researchers for several decades. Incontrast, the WaveScalar architecture provides a memory ordering schemethat efficiently executes programs written in any language.

Solving the memory ordering problem without resorting to a vonNeumann-like execution model enables a completely decentralized dataflowprocessor to be built that eliminates all the large hardware structuresthat make superscalars nonscalable. Other recent attempts to buildscalable processors, such as TRIPS (R. Nagarajan et al., “A design spaceevaluation of grid processor architectures,” International Symposium onMicroarchitecture, 2001 and K. Sankaralingam et al., “Exploiting ILP,TLP, and DLP with the polymorphous trips architecture,” in InternationalSymposium on Computer Architecture, 2003), Smart memories (K. Mai etal., “Smart memories: A modular reconfigurable architecture”, inInternational Symposium on Computer Architecture, 2002) and Raw (W. Leeet al, “Space-time scheduling of instruction-level parallelism on a Rawmachine,” International Conference on Architectural Support forProgramming Languages and Operating Systems, 1998), have extended thevon Neumann paradigm in novel ways, but they still rely on a programcounter to sequence program execution and memory access, limiting theamount of parallelism they can reveal. WaveScalar completely abandonsthe program counter and linear von Neumann execution.

WaveScalar is currently implemented on a substrate comprising aplurality of processing nodes that effectively replaces the centralprocessor and instruction cache of a conventional system. Conceptually,WaveScalar instructions execute in-place in the memory system andexplicitly send their results to their dependents. In practice,WaveScalar instructions are cached in the processing elements—hence thename “WaveCache.”

The WaveCache loads instructions from memory and assigns them toprocessing elements for execution. They remain in the cache over many,potentially millions, of invocations. Remaining in the cache for longperiods of time enables dynamic optimization of an instruction'sphysical placement in relation to its dependents. Optimizing instructionplacement also enables a WaveCache to take advantage of predictabilityin the dynamic data dependencies of a program, which is referred toherein as “dataflow locality.” Just like conventional forms of locality(temporal and spatial), dataflow locality can be exploited by cache-likehardware structures.

Multithreading

Multithreading is an effective way to improve the performance of acomputing system, and designers have long sought to introducearchitectural support for threaded applications. Prior work includeshardware support for multiple thread contexts, mechanisms for efficientthread synchronization, and consistency models that provide threads witha unified view of memory at lower cost. Because of this large body ofwork and the large amount of silicon resources available, threadedarchitectures are now mainstream in commodity systems.

Interestingly, no single definition of a thread has proven suitable forall applications. For example, web servers and other task-based systemsare suited to coarse-grain, pthread-style threads. Conversely, manymedia, graphics, matrix, and string algorithms contain significantfine-grain data parallelism. In addition, sophisticated compilers arecapable of detecting parallelism on several levels—from instructions toloop bodies to function invocations.

Individual architectures, however, tend not to support thisheterogeneity. Threaded architectures usually target a specific threadgranularity or, in some cases, a small number of granularities, makingit difficult or inefficient to execute and synchronize threads of adifferent grain. For example, extremely fine-grain applications cannotexecute efficiently on a shared memory multiprocessor due to the highcost of synchronization. In contrast, dataflow machines provideexcellent support for extremely fine-grain threads, but must beprogrammed in specialized languages to correctly execute traditionalcoarse-grain applications. This requirement stems from dataflow'sinability to guarantee that memory operations will execute in aparticular order.

In principle, if it could solve the ordering issue, a dataflowarchitecture like WaveScalar could support a wide range of threadgranularities by decomposing coarse-grain threads into fine-grainthreads. It would thus be particularly useful to employ such an approachin the WaveScalar architecture to achieve even greater efficiencies inprocessing than can be achieved using only ordered memory for processingcoarse grain threads.

Adding thread support to an architecture requires that designers solveseveral problems. First, they must determine what defines a thread intheir architecture. Then, they must simultaneously isolate threads fromone another and provide mechanisms, such as access to shared state andsynchronization primitives, that allow them to communicate. Popularmultithreaded systems such as SMPs, CMPs (see K. Olukotun et al., “Thecase for a single-chip multiprocessor,” in Architectural Support forProgramming Languages and Operating Systems, 1996), and SMTs (see D. M.Tullsen, et al., “Simultaneous Multithreading: Maximizing on-chipparallelism,” in International Symposium on Computer Architecture, 1995)define a thread in terms of its state, including a register set, aprogram counter, and an address space. In multiprocessors, threadseparation is easy, because each thread has its own dedicated hardwareand threads can only interact through memory. SMTs and other processorsthat support multiple thread contexts within a single pipeline (e.g.,Tera (see R. Alverson et al., “The Tera computer system,” inInternational Conference on Supercomputing, pp. 1-6, 1990)) mustexercise more care to ensure that threads do not interfere with oneanother. In these architectures, threads can communicate through memory,but other mechanisms are also possible.

The *T machine (see B. S. Ang et al., “StarT the next generation:integrating global caches and dataflow architecture,” Tech. Rep.CSGmemo-354, MIT, 1994), the J-Machine (see M. Noakes et al., “Thej-machine multicomputer: An architecture evaluation,” 1993), and theM-machine (see M. Fillo et al., “The M-machine multicomputer,” inInternational Symposium on Computer Architecture, 1995) define threadsin similar terms but support two thread granularities. They usefine-grain threads to enable frequent communication (J-machine, *T) orhide latency (M-machine). Coarse grain threads handle long-running,complex computations (J-machine, *T) or group fine-grain threads forscheduling (M-machine). Threads communicate via shared memory(J-machine, M-machine), message passing (J-machine), and direct accessesof another thread's registers (M-machine).

The Raw machine offers flexibility in thread definition, communication,and granularity by exposing the communication costs between tiles in aCMP-style grid architecture. A thread's state is at least thearchitectural state of a single tile, but could include several tilesand their network switches. Threads communicate through shared memory orby writing to the register files of adjacent tiles. For tightlysynchronized threads, the compiler can statically schedule communicationto achieve higher performance.

The TRIPS processor supports multiple threads by reallocating resourcesthat it would otherwise dedicate to speculatively executing a singlethread. In essence, it uses multiple threads to hide memory and branchlatencies instead of speculating. The parameters that define a threadremain similar to other architectures.

The EM-4 hybrid dataflow machine (see M. Sato et al., “Thread-basedprogramming for the EM-4 hybrid dataflow machine,” in InternationalSymposium on Computer Architecture, 1992; and S. Sakai et al., “Anarchitecture of a dataflow single chip processor,” in InternationalSymposium on Computer Architecture, 1989) defines a thread using a setof registers and a memory frame. Synchronization is performed in adataflow style, and the programmer is provided with library routinesthat make synchronization explicit.

The similarity in thread representation among these architecturesreflects their underlying architectures—all are essentially small,register-based, PC-driven fetch-decode-execute-style processors. Incontrast, WaveScalar is a dataflow architecture, though not the first tograpple with the role of threads. Most notably, the Monsoon architecture(see G. M. Papadopoulos et al., “Monsoon: An explicit token-storearchitecture,” in International Symposium on Computer Architecture,1990; and G. M. Papadopoulos et al., “Multithreading: A revisionist viewof dataflow architectures”, International Symposium on ComputerArchitecture, 1991), the P-RISC architecture (see R. S, Nikhil et al.,“Can dataflow subsume von Neumann computing?,” in InternationalSymposium on Computer Architecture, 1989, Computer Architecture News,17(3), June 1989) and the Threaded Abstract Machine (TAM) architecture(see D. E. Culler et al., “Fine-grain parallelism with minimal hardwaresupport: A compiler-controlled threaded abstract machine,” inProceedings of the 4th International Conference on Architectural Supportfor Programming Languages and Operating Systems, 1991) have developed,to different extents, a model of dataflow machines as systems ofinteracting, fine-grained imperative threads.

P-RISC adapts ideas from dataflow to von Neumann multiprocessors. Tothis end, it extends a RISC-like instruction set with fork and joininstructions and the notion of two-phase memory operations. Programsconsist of numerous small imperative threads with small executioncontexts. Whenever a thread blocks on a long-latency operation, such asa remote load, another thread is removed from a ready queue (called thetoken queue) and executes. Synchronization between threads is handledwith explicit memory instructions.

Programs for the Monsoon Explicit Token Store (ETS) architecture can beorganized as collections of short, von Neumann-style threads thatinteract with each other and with memory using dataflow-stylecommunication. The technique improves code scheduling by takingadvantage of data locality. It also leads to an extension to thearchitecture in which the short, imperative threads employ a small setof high-speed temporary registers that are not part of the threads'stored context. Synchronization between threads is implicit, through thedataflow firing rule and presence bits in memory.

The TAM architecture adapts the Monsoon and P-RISC ideas to takeadvantage of hierarchical memory and scheduling systems. It does this byallowing the compiler more authority in scheduling code and data, addinga new level of scheduling hierarchy (called a quantum), and restrictingcommunication between different groups of threads to well-definedcommunication interfaces. Synchronization between threads is explicit,as in P-RISC.

However, none of the prior art discussed above provides a workablesolution for adapting an ordered memory dataflow architecture, such asWaveScalar, as necessary for enabling efficient processing using bothfine- and coarse-grained threads. Accordingly, there is a need toprovide a solution that includes the benefits of the dataflowarchitectures with such a multigrained thread processing capability.

SUMMARY

Prior WaveScalar work developed an ISA and microarchitecture to executea single coarse-grain thread of execution, which as noted above, isreferred to herein as the WaveScalar architecture. Here, that design isexpanded to support multiple threads. Support for multiple threads bythis exemplary dataflow architecture was developed by providing thefollowing software mechanisms:

-   -   Specific instructions that turn wave-ordered memory on and off.        Since each thread in WaveScalar has a separate memory ordering,        this approach is tantamount to creating and terminating        coarse-grain threads.    -   A simple synchronization primitive that builds a hardware queue        lock. This instruction provides memoryless, distributed,        interthread synchronization by taking advantage of dataflow's        inherent message passing, making it a good match for        WaveScalar's distributed microarchitecture.    -   A new set of memory operations that enable applications to        access memory without adhering to a global memory ordering.        These instructions bypass wave-ordered memory, enabling        independent memory operations to execute in parallel.    -   A dataflow version of a memory fence instruction that enables        applications to use relaxed consistency models and which also        serves as an intrathread synchronization mechanism for threads        whose memory operations bypass wave-ordered memory.

Taken together, these mechanisms enable a dataflow architecture, such asthe exemplary WaveScalar approach, to define and differentiate threadswith a wide range of granularity. The new wave-ordered memory controlmechanisms and the memoryless synchronization primitive, combined withan extended dataflow tag, provide coarse-grain, pthread-style threads.The memory fence mechanism ensures that the execution state of thesethreads becomes consistent with memory, even under a relaxed consistencymodel. The result is a much greater efficiency and processing speed. Forexample, using this multithreaded support, Splash-2 benchmarks executingon WaveScalar architecture were found to achieve speedups of 30-83times, compared to single-threaded execution.

In addition, in an exemplary embodiment, WaveScalar uses the memoryoperations that bypass wave-ordered memory and both synchronizationprimitives to create extremely small threads. These “unordered threads”have very little overhead and may use very few hardware resources.Hence, they are extremely useful for expressing finer-grain loop anddata parallelism and can be used to complete, for example, 7-13.5multiply-accumulates (or similar units of work) per cycle, for threecommonly used kernels.

It has been shown that using this approach, conventional, coarse-grainthreads and fine-grain, unordered threads can interact seamlessly in thesame application. To demonstrate that integrating both styles ispossible and profitable, they were applied to equake from the Spec2000benchmark suite. The outermost loop of equake was parallelized withcoarse-grain, pthread-style threads and a key inner-loop was implementedwith fine-grain threads that use unordered memory. The resultsdemonstrated that the multigranular threading approach achievessignificantly better performance than either the coarse- or fine-grainapproaches alone.

This Summary has been provided to introduce a few concepts in asimplified form that are further described in detail below in theDescription. However, this Summary is not intended to identify key oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

DRAWINGS

Various aspects and attendant advantages of one or more exemplaryembodiments and modifications thereto will become more readilyappreciated as the same becomes better understood by reference to thefollowing detailed description, when taken in conjunction with theaccompanying drawings, wherein:

FIG. 1 illustrates an exemplary wave control flow graph, showing thememory operations in each basic block, their ordering annotations, andhow the annotations enable the store buffer to reconstruct the correctorder, where the darker arrows show the executed path;

FIG. 2 is a schematic diagram illustrating an exemplary hierarchicalorganization of the microarchitecture of the WaveCache;

FIG. 3 is a diagram illustrating three views of exemplary code thatmight be used in WaveScalar, where on the left is illustrated the C codefor a simple computation, a corresponding WaveScalar dataflow graph isshown at the center, and the same graph is mapped onto a small patch ofthe WaveCache substrate at the right of the Figure;

FIG. 4 is a schematic diagram illustrating an example of thread creationand destruction, wherein a thread t spawns a new thread s by sending aTHREAD-ID (s) and a WAVE-NUMBER (u) to a MEMORY-SEQUENCE-START commandand setting up three input parameters for thread s with threeDATA-TOTHREAD-WAVE instructions;

FIG. 5 is a graph showing an exemplary thread creation overhead, withcontour lines for speedups of 1× (i.e., no speedup), 2×, and 4×;

FIGS. 6A and 6B illustrate exemplary contrasting cases for matchinginputs, indicating how most instructions, like the ADD instruction shownin FIG. 6A, fire when the thread and wave numbers on both input tokensmatch, while in contrast, inputs to a THREAD-COORDINATE instructionshown in FIG. 6B match if the THREAD-ID of the token on the second inputmatches the data value of the token on the first input;

FIG. 7 illustrates an example wherein a THREAD-COORDINATE instruction isused to construct a mutual exclusion (mutex) object;

FIG. 8 is a graph showing benchmarks for Splash-2 on the exemplaryWaveCache for between 1 and 128 threads, wherein the bars representspeedup in total execution time, and the numbers above thesingle-threaded bars are IPC for that configuration (note that twobenchmarks, water and radix, cannot utilize 128 threads with the inputdata set used);

FIG. 9 is a graph comparing the performance of various architectures,wherein each bar represents the performance of a given architecture fora varied number of threads;

FIGS. 10 and 11 are graphs comparing the performance of threeimplementation styles, measured in multiply-accumulates for a matrixmultiply (MMUL) and a finite input response (FIR) filter, and incharacter comparisons for a longest common subsequence (LCS), whereinthe graph in FIG. 10 shows execution-time speedup relative to a serialcoarse-grain implementation, and the graph in FIG. 11 compares the workper cycle achieved by each implementation style;

FIG. 12 is a schematic illustration showing transmitioning betweenordered and unordered memory interfaces;

FIG. 13 illustrates an example using ordered and unordered memorytogether, where MEMORY-NOP-ACK is used to combine ordered and unorderedmemory operations to express memory parallelism;

FIG. 14 is a schematic block diagram illustrating an exemplary flow ofoperands through the PE pipeline and forwarding networks;

FIG. 15 is schematic diagram illustrating details of an exemplarymatching table that uses a large number of small, single-ported SRAMS toallow up to four operands to be written each cycle, wherein a trackerboard detects when instructions are ready to fire;

FIG. 16 is schematic diagram illustrating details of the dispatch stageand fire control unit (FCU), which are in charge of schedulinginstructions for execution, wherein the DISPATCH schedules execution onthe arithmetic logic unit (ALU) and is in charge of allowing dependentinstruction to execute on consecutive cycles;

FIG. 17 is a schematic diagram illustrating an exemplary execution(EXECUTE) stage of a PE that includes a single general purpose ALU,which accepts three input operands and implements the WaveScalarinstruction set;

FIG. 18 is a schematic diagram illustrating the OUTPUT interface, whichdistributes the PE's output to consumers, wherein the output resides inthe output queue until sent;

FIG. 19 is a schematic diagram illustrating a high-level view of theinterconnects within a cluster;

FIG. 20 is a schematic diagram showing an exemplary view of theintra-domain interconnect (for space reasons, the interconnect for a2-PE (1 pod) domain is shown, although the exemplary design includes8-PEs); the thick horizontal lines are the broadcast busses for each PEand the network and memory interfaces, while the fine lines convey theACK and NAK signals;

FIG. 21 are exemplary ACK/NAK timing diagrams for a simple transactionbetween PE0 and PE1/PE2;

FIG. 22 is a schematic diagram illustrating the southern port of anexemplary inter-cluster switch, wherein incoming messages are routedtoward their destination (West, North, East, PEs, or Store buffer/L1Cache), and depending on the type of an outgoing message, the switchuses one of two virtual channels to route the message;

FIG. 23 is a graph showing the distribution of interconnect traffic inthe WaveCache for a plurality of different applications, wherein amajority of traffic in the WaveCache is confined within a single clusterand for many applications, over half travels only over the intra-domaininterconnect;

FIG. 24 is a schematic diagram illustrating the store buffer logic andstructures needed to order a single wave of memory requests;

FIG. 25 is a graph illustrating the performance (i.e., speedup) of theWaveScalar architecture, for different microarchitecturalimplementations, when running a variety of applications;

FIG. 26 is a flowchart illustrating exemplary logical steps forprocessing memory operations;

FIG. 27 is flowchart illustrating exemplary logical steps for combiningordered and unordered memory operations using a memory sequence startinstruction;

FIG. 28 is flowchart illustrating exemplary logical steps for combiningordered and unordered memory operations using a memory fenceinstruction; and

FIG. 29 is a flowchart illustrating exemplary logical steps forimplementing partial store control.

DESCRIPTION

Figures and Disclosed Embodiments Are Not Limiting

Exemplary embodiments are illustrated in referenced Figures of thedrawings. It is intended that the embodiments and Figures disclosedherein are to be considered illustrative rather than restrictive.

WaveScalar Overview

This overview discusses only those portions of the WaveScalararchitecture that provide a context for the multigranular threadingapproach. A more in-depth description of WaveScalar is provided in thepreviously filed U.S. Patent Application, noted above, which has beenincorporated herein by reference.

The WaveScalar Instruction Set

In most respects, the WaveScalar instruction set provides the samecomputing capabilities as a RISC instruction set. Differences occurprimarily because it is a dataflow architecture, and with a few notableexceptions, it follows the examples of previous dataflow machines.

WaveScalar binaries: A WaveScalar binary is a program's dataflow graph.Each node in the graph is a single instruction that computes a value andsends it to the instructions that consume it. Instructions execute afterall input operand values have arrived according to a principle known asthe dataflow firing rule.

Waves and wave numbers: When compiling a program for WaveScalar, acompiler breaks its control flow graph into pieces called waves. The keyproperties of a wave are: (1) its instructions are partially ordered(i.e., it contains no back-edges), and (2) control enters at a singlepoint. Unlike a similar construct, hyperblocks, waves may contain truecontrol-flow joins without predication. Doing so facilitates the easycreation of large waves by unrolling loops.

Multiple waves composed of the same static code (for example, iterationsof a loop) may execute simultaneously. To distinguish these instances,known as dynamic waves, each value in the WaveScalar ISA carries a tag,called a WAVE-NUMBER. Together, a value and its WAVE-NUMBER form atoken. The WaveScalar ISA includes special instructions that manipulateWAVE-NUMBERs. Memory-ordering hardware, described below, constrains thenumber of simultaneously executing waves, and schedules their memoryoperations in program order.

Memory ordering: Most programming languages provide the programmer witha model of memory that totally orders memory operations. Lacking anefficient mechanism to support this total load-store ordering, mostprevious dataflow architectures could not effectively execute programswritten in imperative languages (e.g., C, C++, or Java). WaveScalarovercomes this limitation with a technique called wave-ordered memory.In wave-ordered memory, the compiler uses the control flow graph and theinstruction order within basic blocks to annotate each memory operationwith: (1) its position in its wave, called a sequence number; and, (2)its execution order relative to other memory operations in the samewave. As the memory operations execute, these annotations travel to thememory system, allowing it to apply memory operations in the correctorder.

To annotate each memory instruction in a wave, the WaveScalar compilertraverses the wave's control flow graph in breadth-first order. Withinthe basic block at each CFG node, it assigns consecutive sequencenumbers to consecutive memory operations. Next, the compiler labels eachmemory operation with the sequence numbers of its predecessor andsuccessor memory operations, if they can be uniquely determined (see theleft side of FIG. 1). Since branch instructions create multiplepredecessors or successors, a special wild-card value, ‘?’, is used inthese cases.

During program execution, the memory system (in our implementation, astore buffer) uses these annotations to assemble a wave's loads andstores in the correct order. The right side of FIG. 1 shows how thewave-order annotations allow the store buffer to order memory operationsand detect those that are missing. The left side of FIG. 1 illustratesan example 10 of memory operations and their wave-ordered annotations12, 14, 16, 18, 20, and 22. During program execution, a store buffer 24uses these annotations to assemble a wave's loads and stores in thecorrect order. FIG. 1 (right side) shows how the wave-order annotationsenable the store buffer to order memory operations and detect those thatare missing. Assume the load with sequence number 7 (grayed out) is thelast instruction to arrive at the store buffer. Before its arrival, thestore buffer knows that at least one memory operation between memoryoperations 14 and 20, i.e., numbers 4 and 8, is missing, because 4'ssuccessor and 8's predecessor are both ‘?’. As a result, memoryoperation 20, i.e., number 8, cannot be executed. The arrival of loadmemory instruction 22, i.e., number 7, provides links 26 and 28 betweenmemory operations 4 and 8, enabling the store buffer to execute both 7and 8.

The WaveCache: A WaveScalar Processor

The design of a WaveCache is a microarchitecture that executesWaveScalar binaries. An exemplary design is the baseline model used inthe simulations discussed below.

Execution: Conceptually, WaveScalar assumes that each static instructionin a program binary executes in a separate processing element (PE). EachPE manages operand tag matching for its instruction. When two operandswith identical tags arrive at the PE, the instruction executes (this isthe dataflow firing rule) and explicitly communicates the result tostatically encoded consumer instructions.

Clearly, building a PE for each static instruction in an application isboth impossible and wasteful, so in practice, instructions aredynamically bound to a fixed-size grid of PEs and that are swapped inand out on demand. The PEs cache the working set of theapplication—hence the name WaveCache. FIG. 3 shows a simple codefragment 30 mapped onto part of a WaveCache 32.

Processor organization: The WaveCache is a grid of simple, five-stagepipelined processing elements. A register transfer level (RTL) model ofthe design achieves a clock rate of 25 fan out of four (FO4). Each PEcontains a functional unit, specialized memories to hold operands, andlogic to control instruction execution and communication. A functionalunit also contains buffering and storage for several different staticinstructions, although only one can fire each cycle. Each PE handles tagmatching for its own instructions, contributing to the scalability ofthe WaveCache design.

To reduce communication costs within the grid, PEs 64 are organizedhierarchically, as depicted in a block diagram 50 in FIG. 2. Two PEs 64are first coupled, forming a pod 65 that shares operand scheduling andoutput-to-source bypass logic. Within a pod 65, instructions execute onone PE and their results are sent to the partner PE (of the pod) in asingle cycle. Four PE pods 65 comprise one of domains 56 a, 56 b, 56 c,and 56 d, within which producer-consumer latency is five cycles. Thefour domains are then grouped into a cluster 52, which also containswave-ordered memory hardware and a traditional L1 data cache 62 that iscoupled to store buffers 58. A single cluster, combined with an L2 cache54 and traditional main memory, is sufficient to run any WaveScalarprogram. To build larger machines, multiple clusters are connected by anon-chip network 60, and cache coherence is maintained by a traditional,directory-based protocol, with multiple readers and a single writer. Thecoherence directory and the L2 cache are distributed around the edge ofthe grid of clusters. Table 1 describes the WaveCache configurationdiscussed herein. Simulations that were executed accurately modelcontention on all network links and communication busses for operand,memory, and cache coherence traffic. Instruction placement is doneon-demand and dynamically snakes instructions across the grid. TABLE 1Microarchitectural parameters of exemplary WaveCache WaveCache Capacity131,072 static instructions (64 per PE) PEs per Domain 8 (4 pods)Domains/Cluster 4 PE Input Queue 16 entries, 4 banks Network Latencywithin Pod: 1 cycle PE Output Queue 8 entries, 4 ports (2r, 2w) withinDomain: 5 cycles PE Pipeline Depth 5 stages within Cluster: 9 cyclesinter-Cluster: 9 + Cluster dist. L1 Caches 32 KB, 4-way set L2 Cache 16MB shared, 1024 B line, associative, 128 B line, 4 4-way setassociative, 20 accesses/cycle cycle access Main RAM 1000 cycle latencyNetwork Switch 4-port, bidirectional

Wave-ordered memory hardware: The wave-ordered memory hardware isdistributed throughout the WaveCache as part of the store buffers. Eachcluster contains four store buffers, all accessed through a single port.A dynamic wave is bound to one store buffer, which fields all memoryrequests for that wave. The store buffer itself is a small memory thatholds memory requests. A simple state machine implements thewave-ordered memory logic by “walking” the sequence of requests andstalling when it detects a missing operation. This ensures that memoryoperations are issued to the L1 data caches in the correct order.

After a wave executes, its store buffer signals the store buffer for thenext wave to proceed—analogous to a baton pass in a relay race. Thisscheme allows all store buffers to remain logically centralized, despitetheir physically distributed implementation.

The remaining issue lies in assigning store buffers to waves. Toaccomplish this, a table kept in main memory is used that maps wavenumbers to store buffers. Memory instructions send their requests to thenearest store buffer, which accesses the map to determine where themessage should go. If the map already has an entry for the current wave,it forwards the message to the appropriate store buffer. If there is noentry, the store buffer atomically updates it with its own location andprocesses the request.

Coarse-Grain Threads in WaveScalar

As originally developed, the WaveScalar instruction set and WaveCachemicroarchitecture were capable of executing a single coarse-grain threadof execution. However, in the further development of the architecturethat is described herein, support has been added to WaveScalar tosimultaneously execute multiple coarse-grain, pthread-style threads.Three additions to the instruction set architecture (ISA) andmicroarchitecture enable this capability. First, the wave-ordered memoryinterface was extended to simultaneously support active, independentthreads of execution. Second, a lightweight, intrathread synchronizationmechanism was introduced that enables WaveScalar to provide an efficientrelaxed consistency model of memory. Finally, a low overhead, memorylesssynchronization mechanism was introduced that models a hardware queuelock and provides efficient intrathread communication.

Multiple Memory Orderings

As previously introduced, the wave-ordered memory interface providessupport for a single memory ordering. Forcing all threads to contend forthe same memory interface, even if it were possible, would bedetrimental to performance. Consequently, to support multiple threads,the exemplary WaveScalar architecture was extended to allow multipleindependent sequences of ordered memory accesses, each of which belongsto a separate thread. First, every data value in a WaveScalar machinewas annotated with a THREAD-ID in addition to its WAVE-NUMBER. Then,instructions were introduced to associate memory ordering resources withparticular THREAD-IDs. Finally, the necessary changes were made to theWaveCache architecture and the efficiency of the architecture as thusmodified was evaluated.

THREAD-IDs: The WaveCache already has a mechanism for distinguishingvalues and memory requests within a single thread from one another—thevalues and memory requests are tagged with WAVE-NUMBERs. Todifferentiate values from different threads, this tag was extended witha THREAD-ID, and WaveScalar's dataflow firing rule was modified torequire that operand tags match on both THREAD-ID and WAVE-NUMBER. Aswith WAVE-NUMBERs, additional instructions were provided to directlymanipulate THREAD-IDs. In figures and examples included herein, thenotation <t, w>.d signifies a token tagged with THREAD-ID t, andWAVE-NUMBER w, and having a data value d.

To manipulate THREAD-IDs and WAVE-NUMBERs, several instructions wereintroduced that convert WAVE-NUMBERs and THREAD-IDs to normal datavalues and back again. The most powerful of these isDATA-TO-THREAD-WAVE, which sets both the THREAD-ID and WAVE-NUMBER atonce; DATA-TO-THREAD-WAVE takes three inputs, <t₀, w₀>.t₁, <t₀, w₀>.w₁,and <t₀, w₀>.d and produces as output <t₁, w₁>.d. WaveScalar alsoprovides two instructions (DATA-TO-THREAD and DATA-TO-WAVE) to setTHREAD-IDs and WAVE-NUMBERs separately, as well as two instructions(THREAD-TO-DATA and WAVE-TO-DATA) to extract THREAD-IDs andWAVE-NUMBERs.

Managing memory orderings: Having associated a THREAD-ID with each valueand memory request, the wave-ordered memory interface was extended toenable programs to associate memory orderings with THREAD-IDs. Two newinstructions control the creation and destruction of memory orderings,in essence creating and terminating coarse-grain threads. These twoinstructions are: MEMORY-SEQUENCE-START and MEMORY-SEQUENCE-STOP.

MEMORY-SEQUENCE-START creates a new wave-ordered memory sequence, oftena new thread. This thread is assigned to a store buffer, which servicesall memory requests tagged with its THREAD-ID and WAVE-NUMBER; requestswith the same THREAD-ID but a different WAVE-NUMBER cause a new storebuffer to be allocated, as described above.

MEMORY-SEQUENCE-STOP terminates a memory ordering sequence. Thewave-ordered memory system uses this instruction to ensure that allmemory operations in the sequence have completed before the store bufferresources are released. FIG. 4 shows instructions 70 that illustratehow, using these instructions, a thread t creates a new thread s, threads executes and then terminates. In this example, THREAD-ID (s) andWAVE-NUMBER (u) are supplied to MEMORY-SEQUENCE-START 72, and threeinput parameters 74 are set up for thread s with three DATA-TO-THREADWAVE instructions 76. The inputs to each DATA-TO-THREAD WAVE instructionare a parameter value (d, e, or j), the new THREAD-ID (s), and the newWAVE-NUMBER (u). A token 78 with u is deliberately produced by theinstruction MEMORY-SEQUENCE-START, to guarantee that no instructions inthread s will execute until MEMORY-SEQUENCE-START has finishedallocating store buffer area for s. Thread s terminates with instructionMEMORY-SEQUENCE-STOP 80, whose output token <s, u>.finished indicatesthat its store buffer area has been deallocated.

Implementation: Adding support for multiple memory orderings requiresonly small changes to the WaveCache's microarchitecture. First, thewidths of the communication busses and operand queues must be expandedto hold THREAD-IDs. Second, instead of storing every static instructionfrom the working set of a program in the WaveCache, one copy of eachstatic instruction is stored for each thread, which means that if twothreads are executing the same static instructions, each may map thestatic instruction to different PEs.

Efficiency: The overhead associated with spawning a thread directlyaffects the granularity of extractable parallelism. To assess thisoverhead in the WaveCache, a controlled experiment consisting of asimple parallel loop was designed, in which each iteration executes in anewly spawned thread. The size of the loop body was varied, whicheffects the granularity of parallelism, and the dependence distancebetween memory operands, which effects the number of threads that canexecute simultaneously. Speedup compared to a serial execution of a loopdoing the same work was then measured. The experiment's goal was toanswer the following question. Given a loop body with a critical pathlength of N instructions and a dependence distance of T iterations(i.e., the ability to execute T iterations in parallel), can executionbe speeded up by spawning a new thread for every loop iteration?

FIG. 5 is a contour plot of speedup of the loop as a function of itsloop size (critical path length in ADD instructions, the horizontalaxis) and dependence distance (independent iterations, the verticalaxis). Contour lines 90, 92, and 94 are shown respectively for speedupsof 1× (no speedup), 2×, and 4×. The area above each contour line is aregion of program speedup that is at or above the labeled value. Thedata show that the overhead of creating and destroying threads viaMEMORY-SEQUENCE-START and MEMORY-SEQUENCE-STOP is so low that for loopbodies of only 24 dependent instructions and a dependence distance of 3,it becomes advantageous to spawn a thread to execute each iteration. Adependence distance of 10 reduces the size of profitably parallelizableloops to only four instructions. Increasing the number of instructionsto 20 quadruples performance. (If independent iterations need to makepotentially recursive function calls, extra overhead may apply.)

Synchronization

The ability to efficiently create and terminate pthread-style threads,as described in the previous section, provides only part of thefunctionality required to make multithreading useful. Independentthreads must also synchronize and communicate with one another.WaveScalar recognizes two types of synchronization: intrathread andinterthread. Intrathread synchronization can be used to build a relaxedconsistency model by synchronizing the execution of a thread with itsoutstanding memory operations. The second primitive models a hardwarequeue lock and provides interthread synchronization. In the followingsections, the mechanisms that support these two forms of synchronizationare discussed, followed by an exemplary mutex. (A mutex is a programobject that enables multiple program threads to share the same resource,such as file access, but not simultaneously.)

Memory Fence

Wave-ordered memory provides a single thread with a consistent view ofmemory, since it guarantees that the results of earlier memoryoperations are visible to later operations. In some situations, such asbefore taking or releasing a lock, a multithreaded processor mustguarantee that the results of a thread's memory operations are visibleto other threads. An additional instruction, MEMoRY-NOP-ACK, was addedto the ISA to provide this assurance, by acting as a memory fence.MEMORY-NOP-ACK prompts the wave-ordered interface to commit the thread'sprior loads and stores to memory, thereby ensuring their visibility toother threads and providing WaveScalar with a relaxed consistency model.The interface then returns an acknowledgment, which the thread can useto trigger execution of its subsequent instructions.

Multiprocessors provide a variety of relaxed consistency models. Some,including release consistency (S. V. Adve and K. Gharachorloo, “Sharedmemory consistency models: A tutorial,” IEEE Computer (29,12), 1996) andthe model used by the Alpha (see R. L. Sites et al., “Alpha AXPArchitecture Reference Manual,” Digital Press, second ed., 1995), ensurea consistent view only in the presence of memory barrier instructions.MEMORY-NOP-ACK provides this functionality by forcing a thread's memoryoperations to memory.

Interthread Synchronization

Most commercially deployed multiprocessors and multithreaded processorsprovide interthread synchronization through the memory system viaprimitives such as TEST-AND-SET, COMPARE-AND-SWAP, orLOAD-LOCK/STORE-CONDITIONAL. Some research efforts also propose buildingcomplete locking mechanisms in hardware. Such queue locks (for example,A. Kagi et al., “Efficient Synchronization: Let Them Eat QOLB”,International Symposium on Computer Architecture, 1997 and D. M. Tullsenet al., “Supporting Fine-Grain Synchronization on a SimultaneousMultithreaded Processor”, International Symposium on High PerformanceComputer Architecture, 1999) offer many performance advantages in thepresence of high lock contention.

In WaveScalar, support was added for queue locks in a way thatconstrains neither the number of locks nor the number of threads thatmay contend for the lock. This support is embodied in a synchronizationinstruction called THREAD-COORDINATE, which synchronizes two threads bypassing a value between them. THREAD-COORDINATE is similar in spirit toother lightweight synchronization primitives, but is tailored toWaveScalar's dataflow framework. Rather than utilize an additionalhardware memory and finite state machine to implement it, the tagmatching logic used by every PE is exploited to carry out dataflowexecution.

FIGS. 6A and 6B respectively illustrate matching rules 96 and 98required to support THREAD-COORDINATE and how they differ from thematching rules for normal instructions. All WaveScalar instructionsexcept THREAD-COORDINATE fire when the tags of two input values 100 and102 match and they produce outputs 104 with the same tag (FIG. 6A). Forexample, in these Figures, both input tokens 100 and 102, and result 104have a THREAD-ID, to, and a WAVE-NUMBER, w₀.

In contrast, THREAD-COORDINATE fires when the data value of a token atits first input matches the THREAD-ID of a token at its second input.This condition is depicted in FIG. 6B, where the data value of the leftinput token and the thread value of the right input token are both t₁.The THREAD-COORDINATE generates an output token with the THREAD-ID andWAVE-NUMBER from the first input and the data value from the secondinput. In FIG. 6B, this condition produces an output <t₀:w₀>.d. Inessence, THREAD-COORDINATE passes the second input's value (d) to thethread of the first input (t₀). Since the two inputs come from differentthreads, this forces the receiving thread (to in this case) to wait fora message from the sending thread (t₁) before continuing execution.

Although it is possible to implement many kinds of synchronizationobjects using THREAD-COORDINATE, for brevity, an example 120 in FIG. 7only illustrates how THREAD-COORDINATE is used to construct a mutex. Inthis case, THREAD-COORDINATE is the vehicle by which a thread releasinga mutex passes control to another thread wishing to acquire control ofthe mutex.

The mutex in FIG. 7 is represented by a THREAD-ID, t_(m), although it isnot a thread in the usual sense; instead, t_(m)'s sole function is touniquely name the mutex. A thread t₁ that has locked mutex t_(m)releases it in two steps (as shown on the right side of FIG. 7). First,t₁ ensures that the memory operations it executed inside the criticalsection have completed by executing a MEMORY-NOP-ACK 122. Then, t₁ usesa DATA-TO-THREAD instruction 124 to create the token <t_(m), u>.t_(m),which it sends to the second input port of THREAD-COORDINATE, therebyreleasing the mutex.

This token waits at THREAD-COORDINATE's second input port until anotherthread, to in the Figure, attempts to acquire the mutex. When thishappens, to sends a token <t₀, w>.t_(m) (whose datum is the mutex) toTHREAD-COORDINATE. By the rules discussed above, this token matches thatsent by t₁, causing THREAD-COORDINATE to produce a token <to, w>.ttm. Ifall instructions in the critical section guarded by mutex t_(m) dependon this output token (directly or via a chain of data dependencies),thread to cannot execute the critical section until THREAD-COORDINATEproduces it.

Splash-2

WaveScalar's multithreading facilities were evaluated by executingcoarse-grain, multithreaded applications from the Splash-2 benchmarksuite (Table 2). The toolchain and simulator described were used forthis evaluation. An 8×8 array of clusters was simulated to model anaggressive, future-generation design. Using the results from the RTLmodel described above, but scaled to 45 nm, it is estimated that theprocessor occupies about 290 mm², with an on-chip 16 MB L2 cache. TABLE2 Splash-2 Benchmarks and Their Parameters Used in Study BenchmarkParameters fft -m12 lu -n128 radix -n16384 -r32 ocean-noncont -n18water-spatial 64 molecules

After skipping past initialization, execution of the parallel phases ofthe benchmarks were measured. The performance metric is execution-timespeedup relative to a single thread executing on the same WaveCache. TheWaveScalar speedups were also compared to those calculated by otherresearchers for other threaded architectures. Component metrics helpexplain these bottom-line results, where appropriate.

Evaluation of A Multithreaded WaveCache

FIG. 8 illustrates speedups of multithreaded WaveCaches for all sixbenchmarks, as compared to their single-threaded running time. Onaverage, the WaveCache achieved near-linear speedup (27×) for up to 32threads. Average performance still increases with 128 threads, butsublinearly, up to 47× speedup with an average IPC of 88.

Interestingly, increasing beyond 64 threads for ocean and raytracereduces performance, because of WaveCache congestion from their largerinstruction working sets and L1 data evictions due to capacity misses.For example, going from 64 to 128 threads, ocean suffered 18% moreWaveCache instruction misses than would be expected from the additionalcompulsory misses. In addition, the matching cache (used to matchoperand values for execution) miss rate increased by 23%. Finally, thedata cache miss rate, which is essentially constant for up to 32threads, doubles as the number of threads scales to 128. This additionalpressure on the memory system increases ocean's memory access latency bya factor of eleven.

The same factors that caused the performance of ocean and raytrace tosuffer when the number of threads exceeded 64 also reduced the rate ofspeedup improvement for other applications as the number of threadsincreased. For example, the WaveCache instruction miss rate quadrupledfor lu when the number of threads dedicated to the computation increasedfrom 64 to 128, curbing speedup. In contrast, FFT, with its relativelysmall per-thread working set of instructions and data, did not tax theseresources, and so achieved better speedup with up to 128 threads.

Comparison to Other Threaded Architectures

The performance of the WaveCache and a few other architectures wereperformed on three Splash-2 kernels: lu, fft and radix. Results fromseveral sources are presented, in addition to the WaveCache simulatorresults. For CMP configurations, experiments were performed using asimple in-order core (semp), and appropriate measurements were made.Comparing data from such diverse sources is difficult, and drawingprecise conclusions about the results is hard; however, it is believedthat the measurements are still valuable for the broad trends theyreveal.

To make the comparison as equitable as possible, a smaller, 4×4WaveCache was used for these studies. The RTL model gives an area of 253mm² for this design (assuming an off-chip, 16 MB L2 cache, andincreasing its access time from 10 to 20 cycles). While a precise areameasurement was not available for the other architectures, the mostaggressive configurations (i.e., most cores or functional units) are inthe same ball park with respect to size.

To facilitate the comparison of performance numbers of these differentsources, all performance numbers were normalized to the performance of asimulated scalar processor with a 5-stage pipeline. The processor has 16KB data and instruction caches, and a 1 MB L2 cache, all 4-way setassociative. The L2 hit latency is 12 cycles, and the memory accesslatency of 200 cycles matches that of the WaveCache.

FIG. 9 shows the results of the evaluation. The stacked bars in thisFigure represent the increase in performance contributed by executingwith more threads. The bars labeled ws depict the performance of theWaveCache. The bars labeled semp represent the performance of a CMPwhose cores are the scalar processors described above. These processorsare connected via a shared bus between private L1 caches and a shared L2cache. Memory is sequentially consistent, and coherence is maintained bya 4-state snoopy protocol. Up to four accesses to the shared memory mayoverlap. For the CMPs, the stacked bars represent increased performancefrom simulating more processor cores. The 4- and 8-stack bars looselymodel Hydra and a single Piranha chip, respectively.

The bars labeled smt8, cmp4, and cmp2 are the 8-threaded SMT and 4- and2-core out-of-order CMPs. Their running times were extracted from dataprovided by the authors. Memory latency is low on these systems (dozensof cycles) compared to expected future latencies, and all configurationsshare the L1 data- and instruction caches.

To compare the results (labeled ekman in the Figure), which arenormalized to the performance of their 2-core CMP, a superscalar with aconfiguration similar to one of these cores was simulated and thereported execution time was halved; this figure was then used as anestimate of absolute baseline performance. In the reference document,the authors fixed the execution resources for all configurations, andpartitioned them among an increasing number of decreasingly wide CMPcores. For example, the 2-thread component of the ekman bars is theperformance of a 2-core CMP in which each core has a fetch width of 8,while the 16-thread component represents the performance of 16 coreswith a fetch-width of 1. Latency to main memory is 384 cycles, andlatency to the L2 cache is 12 cycles.

The graph shows that the WaveCache can handily outperform the otherarchitectures at high thread counts. It executes 4.4× to 18× faster thanscmp, 5.8× to 18× faster than smt8, and 10× to 20× faster than thevarious out-of-order CMP configurations. Component metrics show that theWaveCache's performance benefits arise from its use of point-to-pointcommunication, rather than a system-wide broadcast mechanism, and fromthe latency-tolerance of its dataflow execution model. The formerenables scaling to large numbers of clusters and threads, while thelatter helps mask the increased memory latency incurred by the directoryprotocol and the high load-use penalty on the L1 data cache.

The performance of all systems eventually plateaus when some bottleneckresource saturates. For scmp this resource is shared L1 bus bandwidth.Bus saturation occurs at 16 processors for LU, 8 for FFT and 2 for RADIX4. For the other von Neumann CMP systems, the fixed allocation ofexecution resources is the limit, resulting in a decrease inper-processor IPC. For example, in ekman, per-processor IPC drops 50% asthe number of processors increases from 4 to 16 for RADIX and FFT. Onthe WaveCache, speedup plateaus when the working set of all the threadsequals its instruction capacity, which offers WaveCache the opportunityto tune the number of threads to the amount of on-chip resources. Withtheir static partitioning of execution resources across processors, thisoption is absent for CMPs; and the monolithic nature of SMTarchitectures prevents scaling to large numbers of thread contexts.

Visual View of WaveCache Execution

Discussion

The WaveScalar architecture has been further extended to supportmultiple pthread-style threads by providing support for creating anddestroying memory orderings and memoryless synchronization. The resultis an efficient threading system that allows multiple coarse-grainthreads to execute on a dataflow machine. The mechanisms are lightweightenough that programmers can also use them to express very fine-grain,loop-level parallelism.

Given the mechanisms described above, it is natural to think of awave-ordered memory sequence as the essence of a thread, because in mostsystems the notion of a thread and its memory ordering are inseparable.But in WaveScalar, this perspective is misleading: nothing in theWaveScalar architecture requires a thread to have a memory ordering. Ifa thread could access memory without interacting with the wave-orderedmemory system, it could avoid the serialization bottleneck that a globalmemory ordering requires. In the next section, an interface to memory isdescribed that avoids the wave-ordered memory system and shows that,combined with fine-grain multithreading, WaveScalar can providesubstantial benefits for applications where a global ordering of memoryoperations is not necessary for correct execution.

Fine-Grain, Unordered Threads

As discussed above, extensions have been provided to the WaveScalarinstruction set that enable the WaveCache to execute multiplecoarse-grain, pthread-style threads simultaneously. The keys to thiswere extending WaveScalar's tags with THREAD-IDs, providing lightweightmemoryless synchronization primitives, and adding managementinstructions to start and stop ordered memory sequences. The ability tostop a memory ordering sequence begs the question, “What if a threaddoes not have an ordered memory interface at all?” Without an orderedmemory interface, WaveScalar threads can execute their memory operationsin any order, potentially exposing massive amounts of parallelism. Suchthreads are referred to herein as fine-grain, unordered threads.

The following section develops the notion of fine-grain, unorderedthreads, describes how they can coexist with the coarse-grain threadsdiscussed above, and uses them to implement and evaluate three simplekernels. The fine-grain, unordered implementations are up to 9× fasterthan coarse-grain threaded versions.

Unordered Memory

As described, WaveScalar's original instruction set allows a thread toexecute without a memory ordering only if the thread does not accessmemory. These threads would be more useful if they could safely read andwrite the same memory used by threads that utilize wave-ordered memory.Then, the coarse-grain threads from the previous section and the newfine-grain, unordered threads could share data through memory.

WaveScalar has been provided with a new, unordered interface to memory.This interface does not require a thread to give up all control over theorder in which memory instructions execute. Instead, it allows thethread to directly control which memory operations can fire in any orderand which must be sequentialized.

To illustrate how WaveScalar accomplishes this, consider a store and aload that could potentially access the same address. If, for correctexecution, the load must see the value written by the store (i.e., aread-after-write dependence), then the thread must ensure that the loaddoes not execute until the store has finished. In threads that usewave-ordered memory, the store buffer enforces this constraint; however,since they bypass wave-ordered memory, unordered threads must have adifferent mechanism.

Dataflow instruction sets like WaveScalar ensure that one instructionexecutes after another by establishing a data dependence between them.(In the above example, this relationship means that the load instructionmust be data-dependent on the store.)

For this technique to work, memory operations must produce an outputtoken that can be passed to the operations that follow. Loads already dothis, because they return a value from memory. However, stores aremodified in the present approach, to produce a value when they complete.

In addition, the unordered instructions do not carry wave-orderingannotations and bypass the store buffers, accessing the L1 data cachesdirectly. To differentiate the unordered memory operations from theirwave-ordered counter-parts, two unordered operations STORE-UNORDERED-ACKand LOAD-UNORDERED are introduced.

Performance Evaluation

To demonstrate the potential of unordered memory in this context, threetraditionally parallel but memory-intensive kernels—matrix multiply(MMUL), longest common subsequence (LCS), and a finite input response(FIR) filter—in three different styles and their performance wascompared. Serial coarse grain uses a single thread written in C.Parallel coarse grain is a coarse-grain parallelized version, alsowritten in C, that uses the coarse-grain threading mechanisms describedabove. Unordered uses a single coarse-grain thread written in C tocontrol a pool of fine-grain, unordered threads, written in WaveScalarassembly.

For each application, the number of threads and the array tile size weretuned to achieve the best performance possible for a particularimplementation. MMUL multiplies 128×128 entry matrices, LCS comparesstrings of 1024 characters, and FIR filters 8192 inputs with 256 taps.Each version is run to completion.

FIGS. 10 and 11 depict the performance of each algorithm executing onthe WaveCache. FIG. 10 shows speedup over the serial implementation, andFIG. 11 illustrates average units of work completed per cycle. For MMULand FIR, the unit of work selected is a multiply-accumulate, while forLCS, it is a character comparison. Application-specific performancemetrics were used for this comparison, because they are more informativethan IPC when comparing the three implementations.

For all three kernels, the unordered implementations achieve superiorperformance by exploiting more parallelism. Using unordered memoryeliminates false dependencies, enabling more memory operations toexecute in parallel. In addition, bypassing the wave-ordering mechanismsreduces contention for limited store buffer resources. The consequenceis a 32-1000× increase in the number of simultaneously executingthreads.

As a result, the fine-grain implementation of MMUL completes 27 memoryoperations per cycle as compared to 17 per cycle for the coarse-grainimplementation.

Multigranular Threading

The extensions to WaveScalar that support coarse-grain, pthread-stylethreads were explained above. In the previous section, two lightweightmemory instructions were introduced that enable fine-grain, unorderedthreads. In this section, these two models are combined. The result is ahybrid programming model that enables coarse- and fine-grain threads tocoexist in the same application. Two examples that illustrate howordered and unordered memory operations can be used together arediscussed below. Then, the discussion indicates how all of the threadingtechniques are exploited to improve the performance of Spec2000's equakeby a factor of nine.

Mixing Ordered and Unordered Memory

A key strength of the ordered and unordered memory mechanisms is theirability to coexist in the same application. Sections of an applicationthat have independent and easily analyzable memory access patterns(e.g., matrix manipulations and stream processing) can use the unorderedinterface, while difficult to analyze portions (e.g., pointer-chasingcodes) can use wave-ordered memory. The following takes a detailed lookat how this feature is achieved.

Two embodiments are described to combine ordered and unordered memoryaccesses. The first turns off wave-ordered memory, uses the unorderedinterface, and then reinstates wave-ordering. The second, more flexibleapproach, allows the ordered and unordered interfaces to existsimultaneously.

EXAMPLE 1

FIG. 12 shows a code sequence 140 that transitions from wave-orderedmemory 142 to unordered memory 144 and back again, to ordered memory146. The process is quite similar to terminating and restarting apthread-style thread. At the end of the ordered code, a THREAD-TO-DATAinstruction extracts the current THREAD-ID, and a MEMORY-SEQUENCE-STOPinstruction terminates the current memory ordering. MEMORY-SEQUENCE-STOPoutputs a value, labeled finished in the figure, after all precedingwave-ordered memory operations have completed. The finished tokentriggers the dependent, unordered memory operations, ensuring that theydo not execute until the earlier, ordered-memory accesses havecompleted.

After the unordered portion has executed, a MEMORY-SEQUENCE-STARTcreates a new, ordered memory sequence using the THREAD-ID extractedpreviously. In principle, the new thread need not have the sameTHREAD-ID as the original ordered thread. In practice, however, thistechnique is convenient, because it allows values to flow directly fromthe first ordered section to the second (the curved arcs on the leftside of the figure) without THREAD-ID manipulation instructions.

EXAMPLE 2

In many cases, a compiler may be unable to determine the targets of somememory operations. The wave-ordered memory interface must remain intactto handle these hard-to-analyze accesses. Meanwhile, unordered memoryaccesses to analyzable operations will simply bypass the wave-orderinginterface. This approach allows the two memory interfaces to coexist inthe same thread.

FIG. 13 shows how MEMORY-NOP-ACK instructions enable programs to takeadvantage of this technique. In a function foo 154, the loads and storesthat copy *v into t can execute in parallel, but must wait for the storeto p, which could point to any address. Likewise, the load from addressq cannot proceed until the copy is complete. The wave-ordered memorysystem guarantees that the store to p, two MEMORY-NOP-ACKS 150 and 152,and the load from q fire in the order shown (top to bottom). The datadependencies between first MEMORY-NOP-ACK 150 and the unordered loads atthe left of the Figure ensure that the copy occurs after the firststore. An add instruction 156 simply coalesces the outputs from the twoSTORE-UNORDERED-ACK instructions 158 and 160 into a trigger for thesecond MEMORY-NOP-ACK that ensures the copy is complete before the finalload.

A Detailed Example: equake

To demonstrate that mixing the two threading styles is not only possiblebut also profitable, we optimized equake from the SPEC2000 benchmarksuite. Equake spends most of its time in the function smvp, with thebulk of the remainder confined to a single loop in the program's mainfunction. In the discussion below, this loop is referenced in main assim.

Both ordered, coarse-grain and unordered, fine-grain threads areexploited in equake. The key loops in sim are data independent, so theyare parallelized, using coarse-grain threads that process a work queueof blocks of iterations. This optimization improves equake's overallperformance by a factor of about 1.6.

Next, the unordered memory interface is used to exploit fine-grainparallelism in smvp. Two opportunities present themselves. First, eachiteration of smvp's nested loops loads data from several arrays. Sincethese arrays are read-only, unordered loads are used to bypasswave-ordered memory, allowing loads from several iterations to executein parallel. Second, a set of irregular cross-iteration dependencies insmvp's inner loop that are caused by updating an array of sums aretargeted. These cross-iteration dependencies make it difficult toprofitably coarse-grain-parallelize the loop. However, theTHREAD-COORDINATE instruction lets fine-grain parallelism be extracteddespite these dependencies, since it passes array elements from PE to PEand guarantees that only one thread can hold a particular value at atime. This idiom is inspired by M-structures, a dataflow-style memoryelement (P. S. Barth et al., “M-structures: extending a parallel,non-strict, functional language with state,” in Conference on FunctionalProgramming Languages and Computer Architecture, 1991). Rewriting smvpwith unordered memory and M-structures improves overall performance by afactor of 7.9.

When both coarse-grain and fine-grain threading are used together,equake speeds up by a factor of 9.0, which demonstrates that thecoarse-grain, pthread-style threads can be used with fine-grain,unordered threads to accelerate a single application.

Exemplary Design Implementation

To explore WaveScalar's true area requirements and performance, asynthesizable pipelined RTL model of the WaveScalar microarchitecture,called the WaveCache, was built. This model synthesizes with a TaiwanSemiconductor Manufacturing Company (TSMC) 90 nm standard cell process.It contains four major components: pipelined processing elements, apipelined memory interface, a multi-hop network switch and a distributeddata cache. These pieces comprise the cluster, which is the basic unitof the WaveCache microarchitecture. Clusters are replicated across thesilicon die to form the processing chip.

In the process of going from a paper design to a synthesizable RTLmodel, a large number of design options were explored to meet area,clock cycle, and instructions-per-clock performance targets. Whereappropriate, results from the cycle-level simulator that illustrate theapplication performance trade-offs are discussed below.

By making the proper engineering trade-offs and developing innovationsin the RTL implementation, it was shown that a high performanceWaveCache can be built in current generation 90 nm process technology.The processor requires 252 mm² of silicon area. The tools that were usedpredicted a clock rate of 20.3 FO4 for the execution core and 25 FO4 forthe memory interface, leading to a final processor clock of 25 FO4. Thisclock rate was achieved through aggressively pipelining themicroarchitecture. While longer than carefully tuned commercial desktopprocessors, it is faster than other prototypes typically created inacademic settings that use similar tools and design flows.

Synthesizable Model

The synthesizable model that was used is written in Verilog. TheSynopsys DesignCompiler™ and DesignCompiler Ultra™ were used for logicalsynthesis. The model integrates several Verilog™ IP models for criticalcomponents, such as SRAM cells, arbiters, and functional units.

ASIC design flow: The design rules for manufacturing devices haveundergone dramatic changes at and below the 130 nm technology node.Issues such as crosstalk, leakage current, and wire delay have requiredsynthesis tool manufacturers to upgrade their infrastructures. Thechanges have also made it more difficult to draw reliable conclusionsfrom scaling down designs done in larger processes. The data presentedbelow was derived with the design rules and the recommended toolinfrastructure of the TSMC Reference Flow 4.0 specification, which istuned for 130 nm and smaller designs. By using these up-to-datespecifications, it was insured, as best as possible, that the resultsscale to future technology nodes.

As noted by TSMC, designs at and below 130 nm are extremely sensitive toplacement and routing. Therefore, TSMC recommends against using thedelay numbers that are produced after logical synthesis. Instead, it isrecommended that the generated netlist be input into Cadence Encounter™for floor planning and placement, and that Cadence NanoRoute™ beemployed for routing. These suggestions were followed. After routing andRC extraction, the timing and area values were recorded. When necessary,the design was fed back into DesignCompiler along with the updatedtiming information, to recompile the design. The area values presentedhere include the overhead from incomplete core utilization.

Standard cell libraries: This design uses the standard cell librariesfrom the TSMC 90 nm process. The 90 nm process is the most currentprocess available, and hence, represents the best target for extractingmeaningful synthesis data. The cell libraries contain all of the logicalcomponents necessary for synthesis in both low-power andhigh-performance configurations. For this study the high-performancecells were exclusively used for all parts of the design, althoughportions of the design that are not timing critical should later bereimplemented with the low-power cells to reduce power consumption.

The memory in the design is a mixture of SRAM memories generated from acommercial memory compiler—used for the large memory structures, such asdata caches—and Synopsys DesignWare™ IP memory building blocks—used forthe other, smaller memory structures. The characteristics (size, delay,etc) of the memory compiler have been explored by others.

Timing data: Architects prefer to evaluate clock cycle time in aprocess-independent metric, fan-out-of-four (FO4). The benefit of usingthis metric is that the cycle time in FO4 does not change (much) as theprocess changes. Thus a more direct comparison of designs can beperformed.

Synthesis tools, however, report delay in absolute terms (nanoseconds).To report timing data in FO4, the common academic practice ofsynthesizing a ring oscillator to measure FO1 and then multiplying thisdelay by 3 was followed. An oscillator was built using the same designflow and standard cells as used in the rest of the design, and an FO1 of16.73 ps was measured, which results in an FO4 of 50.2 ps. All timingdata presented in this paper are reported in FO4 based upon thismeasurement.

Cycle-Level Functional Simulation

In connection with the Verilog RTL model, a correspondingcycle-accurate, instruction-level simulator was built. The simulatormodels each major subsystem of the WaveCache (execution, memory, andnetwork) and has been used to explore many aspects in more detail. Italso serves to answer basic questions, such as sizing ofmicroarchitecture features and performance impact of contention effects,that arise from the actual design. To drive the simulations, a suite ofapplications was executed, as described herein. These applications werecompiled with the DEC Alpha CC compiler and then binary translated intoWaveCache assembly. These assembly files were compiled with theWaveScalar assembler and these executables were used by the simulator.

Microarchitecture: From a programmer's perspective, every staticinstruction in a program binary has a dedicated PE. Clearly, building somany PEs is impractical and wasteful, so, in practice, multipleinstructions are dynamically bound to a fixed number of PEs that areswapped in and out on demand. Thus, the PEs cache the working set of theapplication; hence, the microarchitecture that executes WaveScalarbinaries is called a WaveCache. As discussed above, FIG. 3 illustrateshow a WaveScalar program 30 on the left side of the Figure can be mappedinto a WaveCache 32. The conflicting goals of the instruction mappingalgorithm (which maps dynamically as the program executes) are to placedependent instructions near each other to minimize producer-consumerlatency, and to spread independent instructions out in order to utilizeresources and exploit parallelism.

Each PE 34 in FIG. 3 contains a functional unit, specialized memories tohold operands, and logic to control instruction execution andcommunication. Each PE also contains buffering and storage for severaldifferent static instructions. The PE has a five-stage pipeline, withbypass networks allowing back-to-back execution of dependentinstructions at the same PE. Two aspects of the design warrant specialnotice. First, it avoids a large centralized, associative tag matchingstore, found on some previous dataflow machines. Second, although PEsdynamically schedule execution, the scheduling hardware is dramaticallysimpler than a conventional dynamically scheduled processor. The PEdesign is described in detail below.

To reduce communication costs within the grid, PEs are organizedhierarchically, as shown in FIG. 2 and as described above. PEs arecoupled into pods; within a pod 65, PEs snoop each other's resultnetworks and share scheduling information. These pods are furthergrouped into domains; within each domain, PEs communicate over a set ofpipelined busses. Four domains form each cluster 52, which also containswave-ordered memory hardware (in store buffers 58), network switch 60,and L1 data cache 62, as noted above.

The baseline design: The exemplary RTL-level model described herein is a4×4 array of 16 clusters, each containing a total of 16 pods (32 PEs),arranged 4 per domain. In the 90 nm process, each cluster occupies 16mm², yielding a 263 mm WaveCache.

The next three sections describe the exemplary RTL model of a WaveCacheprocessor comprising 16 clusters in a 4×4 grid, as noted above inconnection with Table 1. During the design of this model, many designoptions were considered, and choices were made based on the effect theyhad on delay, area, and application performance.

Processing Elements (PEs)

The WaveCache contains the same overall structures as all computingdevices, namely execution, interconnect, and memory resources. Itsmicroarchitecture is presented using this organization, to give acontext in which to view each type of resource. This section focuses onthe execution resources.

The execution resources of the WaveCache are comprised of hundreds ofpipelined PEs. The following discussion explains the microarchitectureof the PEs by first describing their function and providing a broadoverview of their pipeline stages. An example is presented below thattraces the execution of a short sequence of instructions through apipeline. Following this example, each pipeline stage is described indetail.

A PE's Function

At a high level, the structure of a PE pipeline resembles a conventionalfive-stage, dynamically scheduled execution pipeline. The greatestdifference between the two is that the PE's execution is entirelydata-driven. Instead of executing instructions provided by a programcounter, as would occur on von Neumann machines, values arrive at a PEdestined for use by a particular instruction. These values triggerexecution—the essence of dataflow execution. A pre-decoded instructionis fetched from a local instruction store in the PE and, when allinstruction inputs are available, the instruction executes and sends itsresult to trigger the execution of other instructions.

The five pipeline stages of a PE are:

-   -   1. INPUT: Operand messages arrive at the PE either from another        PE or from itself The PE may reject messages if too many arrive        in one cycle; the senders will then retry on a later cycle.    -   2. MATCH: Operands enter the operand matching table. The        matching table determines which instructions are now ready to        fire, and issues eligible instructions by placing their matching        table index into the instruction scheduling queue.    -   3. DISPATCH: The PE selects an instruction from the scheduling        queue, reads its operands from the matching table and forwards        them to EXECUTE. If the destination of the dispatched        instruction is local, this stage speculatively issues the        consumer instruction to the scheduling queue.    -   4. EXECUTE: An instruction executes. Its result goes to the        output queue and/or to the local bypass network.    -   5. OUTPUT: Instruction outputs are sent via the output bus to        their consumer instructions, either at this PE or a remote PE.

The pipeline design includes bypass paths that enable program flow tomove from the end of execution directly to the beginning of execution ofan instruction. This bypass network, combined with hardware scheduling,enables back-to-back execution of dependent instructions.

FIG. 14 illustrates how instructions from a simple dataflow graph 161flow through a pipeline 162 and how their execution affects a matchingtable 164 and a scheduling queue 166. This Figure also illustrates howthe bypass network allows two instructions A and B to execute onconsecutive cycles. In this sequence, A's result is forwarded to B whenB is in EXECUTE. In the diagram, X[n] is the nth input to instruction XFive consecutive cycles 168 a, 168 b, 168 c, 168 d, and 168 e aredepicted; before the first of these, cycle 168 a, one input each frominstructions A and B have arrived and reside in matching table 164.

“Clouds 170” in the dataflow graph represent results of instructions atother processing elements, which have arrived from the input network.

Cycle 0: Operands A [0] arrives and INPUT accepts it.

Cycle 1: MATCH writes A[0] into the matching table and, because both itsinputs are now available, places a pointer to A's entry in matchingtable 164 into scheduling queue 166.

Cycle 2: DISPATCH chooses A for execution, reads its operands and sendsthem to EXECUTE. At the same time, it recognizes that A's output isdestined for B; in preparation for this producer-consumer handoff, apointer to B's matching table entry is inserted into the schedulingqueue.

Cycle 3: DISPATCH reads B[0] from the matching table and sends it toEXECUTE. EXECUTE computes the result of A, which is B[1].

Cycle 4: EXECUTE computes the result of instruction B using B[0] and theresult from the bypass network.

Cycle 5 (not shown): OUTPUT will send B's output to Z.

This example serves to illustrate the basic mechanics of PE operation.Each stage is next described in detail, as well as the design trade-offsinvolved in each.

Input

At each cycle, INPUT monitors the incoming operand busses. In theexemplary RTL model, there are 10 busses: one is the PE's output bus,seven originate from other PEs in the same domain, one is the networkbus, and one is the memory interface. INPUT will accept inputs from upto four of these busses each cycle. If more than four arrive during onecycle, an arbiter selects among them; rejected inputs are retransmittedby their senders. Four inputs is a reasonable balance betweenperformance and design complexity/area. Due to the banked nature of thematching table (see below), reducing the number of inputs to three wasfound to have no practical area-delay benefit. Two inputs, however,reduced application performance by 5% on average, but by 15-17% for someapplications (ammp and fir). Doubling the number of inputs to eightincreased performance by less than 1% on average.

As noted above, WaveScalar is a tagged token dataflow machine, whichmeans all data values carry a tag that differentiates dynamic instancesof the same value. Tags in WaveScalar are comprised of two fields: aTHREAD-ID and a WAVE-NUMBER. Since each PE can hold multiple staticinstructions, messages on the busses also carry a destinationinstruction number. INPUT computes a simple XOR hash of the THREAD-ID,WAVE-NUMBER, and destination instruction number for each operand, whichis used to index the matching table. INPUT then places the (up to four)operands it has selected, along with their hashes, into its pipelineregister for MATCH to process in the next clock cycle.

Neglecting domain wiring overhead, which will be accounted for below,INPUT's actual logic consumes 8.3% (0.03) of the PE's area. It achievesa clock rate of 13.7 FO4 in isolation, which is significantly shorterthan the other stages. However, the rest of the clock period is taken upby delay in the intra-domain interconnect.

Match

The next two pipeline stages comprise the operand tag matching andinstruction dispatch logic. Implementing these operationscost-effectively is essential to an efficient dataflow design and hashistorically been an impediment to more effective dataflow execution.The key challenge in designing the WaveCache matching table is emulatinga potentially infinite table with a much smaller physical structure.This problem arises because WaveScalar is a dynamic dataflowarchitecture, and places no limit on the number of dynamic instances ofa static instruction with unconsumed inputs.

To address this challenge, the matching table is a specialized cache fora larger in-memory matching table, a common dataflow technique. MATCHwrites operands into the matching table, and DISPATCH reads them out.The table is separated into three columns, one for each potentialinstruction input. Associated with the matching table is a trackerboard, which holds the operand tag, consumer instruction number,presence bits which denote which operands have arrived, and a pin bitwhich indicates which instructions have all of their operands and areready to execute.

When new operands arrive from INPUT, the PE attempts to store each ofthem in the matching table, using the hash as the index. For eachoperand, there are four possibilities: (1) an operand with the same taghas already arrived, so there is a space waiting for it in the matchingtable; (2) no other operands with the same tag have arrived, and theline is unoccupied, and in this case, MATCH allocates the line to thenew operand and updates the tracker board; (3) the line is occupied bythe operands for another instruction, and in this case, the PE rejectsthe message and waits for the sender to retry; after several retries,operands resident in the matching table are evicted to memory, and thenewly empty line is allocated to the new operand; and, (4) the line isoccupied by the operands for another instruction which is pinned to thematching table, which occurs when the instruction is ready to, but hasnot yet executed; as in case (3), the message is rejected and will beresent, and after four retries the new operand is written to memory.Scenarios (3) and (4) are matching tables misses.

In parallel with updating the matching table, MATCH checks the presencebits to see if any of the operands that arrived in INPUT were the lastones needed to allow an instruction to execute. If this is the case,MATCH pins the corresponding line in place and adds its matching tableindex and tag to the scheduling queue (described below, in the nextsection).

While the average occupancy of the matching table is low, it was foundto be critical to have at least 16 entries to handle bursty behavior.Reducing the number of entries to eight dropped performance on averageby 23%. Doubling to 32 added almost no gain to applications written inC, but increased performance on fine-grained dataflow kernels by 36%.Because this configuration consumes substantially more area (see below)and provides limited improvement on C-based applications, 16 entrieswere chosen for the exemplary RTL implementation. This application-classvariation, however, suggests that designers will want to tune thisparameter, depending on the target market.

Since the matching table is a cache, traditional caching optimizations,such as associativity and banking, are employed to reduce arearequirements, miss rate, and miss penalty. The basic design is two-wayset associative, and each way is banked by four, to increase read/writeport availability. Given three operand queues 180 a, 180 b, and 180 c,and a tracker board 182 for each of four banks 184 a, 184 b, 184 c, and184 d, the entire design requires 32 small static random access memories(SRAMs) (four tables, two sets/table, four banks/set, as shown in FIG.15, bottom of each tracker board), each of which contains two matchingtable entries. SRAMs for the first two operands are 64 bits wide. Thethird operand is used only for single-bit operands (control bits), soits SRAMS are one bit. Tracker board SRAMs are 76 bits.

The figure shows the data paths between the SRAMs in detail. Operandmessages 186 from INPUT arrive at the top. Data values 188 flow down tothe operand arrays, while tag information 190 travels to the trackerboard. A comparator 192 a, 192 b, 192 c, and 192 d for correspondingbanks determines whether a line has previously been allocated to anoperand; the hash value, operand select, and tracker board pick theline, bank and “way” where the operands should reside. Bank conflicts inthe matching table are handled by rejecting the input.

RTL synthesis of MATCH shows that, in isolation, MATCH hardware consumes0.1 mm², 29.8% of total PE area, and achieves a clock cycle of 20.3 FO4.Doubling the input queue size gives a near linear increase in area (0.17mm or 39% of the PE)—a 20% increase in overall PE size, and 5% increasein delay. MATCH and DISPATCH are the longest stages in the PE, soincreases in queue size should be considered with care.

Dispatch

The DISPATCH stage, and a fire control unit (FCU) 194 (shown in FIGS. 15and 16) are in charge of scheduling instructions for execution. In thesimplest dispatching case, the FCU removes the first entry fromscheduling queues 196 (FIG. 16), reads the corresponding line frommatching table 164, and passes the operands to EXECUTE for execution.This behavior is sufficient for correct execution, but does not allowdependent instructions to execute on consecutive clock cycles.

To achieve back-to-back execution of dependent instructions, bypassingpaths are provided that send results from the end of EXECUTE directlyback to the beginning of EXECUTE. In addition, the FCU can speculativelyissue a consumer of the result, readying it to use the newly producedresult on the next cycle. In particular, when the FCU selects an entryfrom the scheduling queue, it accesses the instruction store todetermine which, if any, of the instruction's consumers reside at thesame PE. If there is a local consumer, the FCU computes the index of itsline in the matching table and inserts it into a special schedulingqueue, called a speculative fire queue 198.

Placing a consumer instruction in the speculative fire queue is aspeculative act because the FCU cannot tell whether the producer'sresult will allow it to fire (i.e., whether the instruction's otheroperands already reside in the matching table). In the example in FIG.14, although the FCU knows that A will produce operand B[1], it does notknow if B's second input, B[0], is present in the matching table.Operand availability is resolved in EXECUTE, where the speculativeinstruction's tag from the matching table (the unknown operand, sent toEXECUTE when the consumer is dispatched) is compared to the tag of theproducer's result (the known operand, just computed). If they match, andif the presence bits match the required operand signature bits, theconsumer instruction executes successfully, and the matching table entryis cleared. If not, then the result is squashed, and the matching cacheentry is left unchanged.

DISPATCH gives higher priority to keeping a PE busy than dispatchingdependent instructions back-to-back. Therefore, it will usually chooseto execute nonspeculative instructions over speculative. In particular,if there are enough nonspeculative instructions in the scheduling queueto allow a producer's result to flow from OUTPUT back to MATCH (where itwill be placed in the matching table and the match table logic willdetermine whether the consumer should fire), DISPATCH will choose thenonspeculative instructions. Otherwise, it will gamble that all theconsumer's operands have arrived and dispatch it.

The scheduling queue size is 16 entries, chosen to be equivalent to thematching table, thus simplifying the design. The speculatively scheduledqueue slot is maintained in a separate register.

The final piece of the FCU is an Instruction Control Unit (ICU) 200(FIG. 16), which contains the PE's decoded static instructions, theiropcodes, the consumers of their results, and immediate values. The ICUin the RTL design holds 64 decoded static instructions, each 59 bits.Decreasing the number of instructions to 32 impacts performance by 23%on average; doubling it to 128 increases performance by only 3%, butalso increases ICU area by 120% and PE area by 55% and cycle time by 4%.Nevertheless, these results indicate that designers of small WaveCaches'(one or a small number of clusters) should choose the larger design.

DISPATCH shares a large portion of its logic with MATCH. The separatehardware includes the ICU, the scheduling queue, and the control logic.These added components require 0.17 mm² (49% of the PE area), nearly allof which is in the ICU. DISPATCH has the same delay as MATCH (20.3 FO4).

Execute

FIG. 17 illustrates exemplary functional components 208 for EXECUTE.These components include 3:1 multiplexers 210 and 212, an arithmeticlogic unit (ALU) 214, and an instruction control unit 216. EXECUTEhandles three different execution scenarios: (1) the usual case is thatall operands are available and the output queue can accept a result; theinstruction is executed; the result written to the output queue; and theline in the matching table is unpinned and invalidated; (2) aspeculative instruction, some of whose inputs are missing, wasdispatched; in this case, the result is squashed, and the matching tableline is unpinned but not invalidated; (3) no space exists in the outputqueue, and in this case, EXECUTE stalls until space is available.

In addition to a conventional functional unit, EXECUTE contains atag-manipulation unit (not separately shown) that implementsWaveScalar's tag manipulation instructions and logic for handling itsdata steering instructions. PEs are non-uniform. In a current exemplarydesign, six PEs compute integer instructions only. These require 0.02mm² (5.7% of the PE). Two PEs per domain contain a floating point unit(FPU) in addition to the integer core. These FPU-enabled PEs require anadditional 0.12 mm

Output

OUTPUT sends a result from the ALU to the consumer instructions thatrequire it. FIG. 18 shows functional components 220 in one exemplarydesign for OUTPUT. OUTPUT contains a four-entry output queue 222 that isconnected directly to the PE's output buffer. Also included are a rejectbuffer 224, a reject message modifier 226, and a local router 228.Values can enter the output queue either from EXECUTE or from the rejectbuffer (explained below). If the output queue is empty, incoming valuesgo directly to an output buffer 230. The precise size of the outputqueue has little effect on performance—four entries are sufficient. Thereason it tends not to influence performance is that result valuesnormally flow uninterrupted to their destination PEs. The output bufferbroadcasts the value on the PE's broadcast bus. In the common case, theconsumer PE within that domain accepts the value immediately. It ispossible, however, that the consumer cannot handle the value that cycleand will reject it. ACK/NACK signals require four cycles for the roundtrip. Rather than have the data value occupy the output buffer for thatperiod, the PE assumes it will be accepted, moving it into thefour-entry reject buffer, and inserts a new value into the output bufferon the next cycle. If an operand ends up being rejected, it is fed backinto the output queue to be sent again to the destinations that rejectedit. If all the receivers accept the message, the reject buffer discardsthe value. When rejected messages are going from the reject buffer tothe output queue, any message from the execution unit bypasses theoutput queue to avoid queuing two messages on the same cycle, asdescribed in detail below.

Each instruction has its consumer instruction locations stored in theinstruction cache. The destinations can either be to memory, or to up totwo other PEs. Each destination has a valid bit that is cleared wheneverthe destination PE accepts the message, which can happen either throughthe standard output network, or when PEs in the same pod successfullyexecute a speculatively scheduled instruction. The output queue stopssending the message when all destination bits are clear.

Since there is no determined length of time that an entry can sit in thematching cache, there must be a mechanism for preventing messages fromcycling through the reject buffer enough times to affect the sender'sperformance. To handle this condition, the sender keeps a two-bitcounter of the number of times that the message has been rejected. Whenthis counter reaches its maximum value, the sender requests that thereceiver forcefully accept the message. When the receiver gets a forcedaccept request, it rejects the message, but places the entry that isblocking the message into the scheduling queue to fire. Instead offiring normally, the entry is sent through the pipeline withoutmodifications to its tag or data. This entry will then travel throughthe pipeline in the standard manner, but instead of going to itsdestination, it goes to the memory pseudo-PE with a special flag toindicate that the message should be sent back later. The memorypseudo-PE holds a table of entries that have been sent to the L1 cachethat need to be resent to the domain, and retrieves those entries later.In the special case that two operands are stalled, then the fire controlunit will send each operand in a separate message. This mechanismrequires very little extra logic to implement, and guarantees that eachmessage will eventually make it to the receiver.

The output stage consumes 9% of the PE's area. It achieves a clock rateof 17 FO4 in isolation, and the remainder of its clock cycle is devotedto wire delay in the intra-domain interconnect.

PE Area and Timing

In total, each PE consumes 0.36 mm², and all together, comprise 87% oftotal chip area. The matching table stage in the PE is the critical path(20.3 FO4) for both the PE and the domain. Within MATCH, the longestpath is the one that updates the scheduling queue. This path depends ona read/compare of the matching table.

In addition to the eight PEs, each domain contains two pseudo-PEs(called MEM and NET) that serve as portals to the memory system and PEsin other domains and other clusters. Each pseudo-PE contains bufferingfor 64 messages. The NET and MEM pseudo-PEs are 0.08 mm and 0.06 mm²,respectively.

An entire domain occupies 3.6 mm². In order to estimate the area of thedomain exclusive of its intra-domain interconnect (described in the nextsection), the PEs were synthesized in isolation and the areas wascompared to the total domain area after processing with CadenceEncounter™. Using this estimate, it was found that the domaininterconnect was 8.6% of the total domain size.

The Network

The preceding section describes the execution resource of the WaveCache,i.e., the PE. This section provides details about how PEs on the samechip communicate. PEs send and receive data using a hierarchical on-chipinterconnect system 240, which is shown in FIG. 19. There are fourlevels in this hierarchy: intra-pod 242, intra-domain 244, intra-cluster246, and inter-cluster. The first three of these networks areillustrated in FIG. 19, which depicts exemplary details of a singlecluster 52 (shown in FIG. 2). The fourth network, the inter-clusternetwork, is a dynamically routed packet network that connects theclusters using a switch 248. While the purpose of each network is thesame—transmission of instruction operands and memory values—the designvaries significantly across them. Salient features of these networks aredescribed below in the next four sections.

PEs in A Pod

The first level of interconnect, intra-pod interconnect 242, enables twoPEs 64 to share their bypass networks and scheduling information.Merging a pair of PEs into a pod 65 provides lower latency communicationbetween them than using intra-domain interconnect 244 (see below).

While PEs in a pod snoop each other's bypass networks, all other aspectsof a PE remain partitioned—separate matching tables, scheduling andoutput queues, etc. The intra-pod network transmits data from theexecution units, and transmits instruction scheduling information fromthe Fire Control Units.

Currently, the exemplary RTL model is implemented with two PEs per pod.The simulations show that this design is 5% faster on average than PEsin isolation and up to 15% faster for vpr and ammp. Increasing thenumber of PEs in each pod would further increase IPC, but since DISPATCHis already the longest stage in the PE, it would have a detrimentaleffect on cycle time.

The Intra-Domain Interconnect

PEs communicate over intra-domain interconnect 244, shown in detail inFIG. 20. Its interface to both PEs and pseudo-PEs is identical. Theintra-domain interconnect is broadcast-based. Each of the eight PEs hasa dedicated 164-bit result bus that carries a single data result to theother PEs in its domain. Each pseudo-PE also has a dedicated 164-bitoutput bus. PEs and pseudo-PEs communicate over the intra-domain networkusing a garden variety ACK/NACK network. The timing of this network inthis design is illustrated with an exemplary timing diagram 250 shown inFIG. 21. In this example PE0 is trying to send D0 to PE1 and PE2, and D1to PE1.

Cycle 0: PE0 sends D0 to PE1 and PE2. The OUTPUT stage at PE0 preparesthe message and broadcasts it, asserting the PE1 and PE2 receive lines.

Cycle 1: PE0 sends D1 to PE1, which reasserts its receive line. At PE1and PE2, INPUT processes D0 and sends it to MATCH.

Cycle 2: PE0 goes idle. INPUT at PE1 receives D1 and sends it to MATCH.MATCH of PE2 detects a matching table conflict for D0 and asserts theNACK signal. PE1 does not have a conflict and, by not asserting NACK,accepts the message.

Cycle 3: The interconnect delay.

Cycle 4: PE0 receives the NACK signal from PE2 for D0.

Cycle 5: PE0 notes that PE1 accepted D1 and attempts to retry sending D0to PE1.

There are two advantages of using ACKNACK flow control for this network.The first is a large reduction in area. There are ten inputs 252 to eachPE, and adding a two-entry buffer to each input would require 2868 bitsof buffering at each receiver. Instead, only 169 bits of buffering areused at the sender in this exemplary design. Second, ACK/NACK flowcontrol allows messages to bypass the rejected messages. Theconsequences of these advantages are a lower clock rate and sustainedbandwidth.

The downside, however, is that rejected messages take far longer toprocess. In our experiments, we found that on average fewer than 1% ofmessages were rejected. As there is only one ALU per PE, provisioningthe network to send more than one result per cycle is useful only forprocessing these relatively few rejected messages. Widening the PEbroadcast busses to transmit two results increased performancenegligibly and significantly increased the complexity of the PEs' inputand output interfaces.

The Intra-Cluster Interconnect

The intra-cluster interconnect provides communication between the fourdomains' NET pseudo-PEs. It also uses an ACK/NACK network similar tothat of the intra-domain interconnect, with some additional buffering.An extra pipeline stage is added to the network to account for wiredelay. The pseudo-PEs occupy only 8% of the domain area. Synthesized inisolation, they pass timing at 20 FO4, with considerable slack to spare(i.e., they can be clocked faster).

The Inter-Cluster Interconnect

The inter-cluster interconnect is responsible for all long-distancecommunication in the WaveCache, which includes operands travelingbetween PEs in distant clusters and coherence traffic for the L1 caches.

Each cluster contains an inter-cluster network switch, each of whichroutes messages between 6 input/output ports. Four of the ports lead tothe network switches in the four cardinal directions, one is sharedamong the four domains' NET pseudo-PEs, and one is dedicated to thestore buffer and L1 data cache.

The inter-cluster interconnect uses a simple dynamic routing switch.Each switch has six input/output ports, each of which supports thetransmission of up to two operands. Its routing follows a simpleprotocol: the current buffer storage state at each switch is sent to theadjacent switches, which receive this information a clock cycle later.Adjacent switches only send information if the receiver is guaranteed tohave space.

The inter-cluster switch provides two virtual channels that theinterconnect uses to prevent deadlock. FIG. 22 shows the details of oneinput/output port 260 of the inter-cluster switch. Input/output port 260includes an input arbiter 262, which controls an input selectmultiplexer 264 and a reject control 266. A selected input is applied toa demultiplexer 270. A channel select 268 controls demultiplexer 270 andalso a multiplexer 272, which determines the output of a queue 274 thatis input to reject control 266. A channel arbiter 276 controls amultiplexer 278 to control the output from queue 274 that is applied toa South block 280. The output of South block 280 is provided as an inputto the ports via a data line 282 and to a routing block 284. Each outputport thus includes two 8-entry output queues (one for each virtualnetwork). In some cases, a message may have two possible destinations(i.e., North and West if its ultimate destination is to the northwest).In these cases routing block 284 randomly selects which way to route themessage.

The network carries messages that are 164 bits wide and include adestination location in the grid. In each message, 64 bits are used fordata, and 64 bits for tag; the additional bits are for routing. Thedestination routing includes the following elements: destination clusterx and y (four bits each), destination domain (two bits), destination PE(three bits), destination virtual slot number (six bits), anddestination operand number (two bits). Memory messages are also routedover this network, and share routing bits with those used for sendingoperands. Memory messages are routed with the cluster position, sequencetag information (15 bits) and store buffer number (two bits).

In the TSMC process there are nine metal layers available, which meansthe long distance inter-cluster wires sit above the main cluster logic,minimizing the area impact of the switches. Each cluster switch requires0.34 mm² and achieves a clock cycle of 19.9 FO4. In aggregate, thenetwork switches account for 2% of the entire die.

Network Traffic

One goal of the WaveCache interconnect is to isolate as much traffic aspossible in the lower layers of the hierarchy (e.g., within a PE, a pod,or a domain), and rely on the upper levels only when absolutelynecessary. FIG. 23 illustrates a graph 290 that shows the division oftraffic among different layers of the hierarchy. On average, 28% ofnetwork traffic travels from a PE to itself or the other PE in its pod,48% of traffic remains within a domain, and only 2.2% needs to traversethe inter-cluster interconnect. Fine-grain applications require moreinter-cluster traffic (33% of operands), which reflects the increasedsynchronization overhead required for fine-grain threading.

The graph also shows the division between operand data andmemory/coherence traffic. Memory traffic accounts for 12% of messages onaverage. For the Spec2000 applications, less than 1% of those messagesleave the cluster, because the instruction working set for each of theseapplications fits within a single cluster. Data sharing in the Splash-2benchmarks increases inter-cluster memory traffic to 17% of memorytraffic, but still only 0.4% of total network traffic—everything else islocal.

These results demonstrate the scalability of communication performanceon the WaveCache. Applications that require only a small patch of theWaveCache, such as Spec, can execute without ever paying the price forlong distance communication.

Waves and Wave-ordered Memory

The hardware support for wave-ordered memory lies in the WaveCache'sstore buffers. Waves and wave-ordered memory enable WaveScalar toexecute programs written in imperative languages, such as C, C++, orJava, by providing the well-ordered memory semantics these languagesrequire. The hardware support for wave-ordered memory lies in theWaveCache's store buffers. WaveScalar is a tagged token dataflowmachine. It supports execution of applications written in mainstreamimperative languages through the use of a special memory interface,wave-ordered memory. The key difference in implementing the hardware forthis interface, as compared to a conventional store buffer or load/storequeue, is the order in which memory operations fire. Instead of beingsequenced by an instruction fetch mechanism, it is under direct programcontrol. A brief review is provided at this point, in order to providecontext for the microarchitectural design.

When compiling a WaveScalar program, a compiler breaks its control flowgraph into pieces called waves. The key properties of a wave are: (1)its instructions are partially ordered (i.e., it contains noback-edges); and, (2) the control enters at a single point. The compileruses the control flow graph and the instruction order within basicblocks to annotate each memory operation with (1) its position in itswave, called a sequence number, and (2) its execution order (predecessorand successor) relative to other memory operations in the same wave, ifthey are known. Otherwise, they are labeled with ‘?’.

During program execution, the memory system (in this exemplaryimplementation, a store buffer) uses these annotations to assemble awave's loads and stores in the correct order. FIG. 1, which wasdiscussed above, shows how the wave-ordering annotations enable thestore buffer to order memory operations and detect those that aremissing.

Store Buffers

The store buffers, one per cluster, are responsible for implementing thewave-ordered memory interface that guarantees correct memory ordering.To access memory, processing elements send requests to their local storebuffer via the MEM pseudo-PE in their domain. The store buffer willeither process the request or direct it to another buffer via theinter-cluster interconnect. All memory requests for a single dynamicinstance of a wave (for example, an iteration of an inner loop),including requests from both local and remote processing elements, aremanaged by the same store buffer.

To simplify the description of the store buffer's operation, Rpred,R.seq, and R.succ are denoted as the wave-ordering annotations for arequest R. Also, next(R) is defined to be the sequence number of theoperation that actually follows R in the current instance of the wave.The next(R) is determined either directly from R.succ or is calculatedby the wave-ordering memory, if R.succ is ‘?’.

The store buffer contains four major microarchitectural components: anordering table, a next table, an issued register, and a collection ofpartial store queues. Store buffer requests are processed in threepipeline stages: MEMORY-INPUT writes newly arrived requests into theordering and next tables. MEMORY-SCHEDULE reads up to four requests fromthe ordering table and checks to see if they are ready to issue.MEMORY-OUTPUT dispatches memory operations that can fire to the cache orto a partial store queue (described below). Each pipeline stage of thismemory interface is described in detail below.

MEMORY-INPUT accepts up to four new memory requests per cycle. For eachmemory request, it writes its address, operation, and datum (ifavailable, for stores) into the ordering table at the index R.seq. IfR.succ is defined (i.e., is not ‘?’), the entry in the next table atlocation R.seq is updated to R.succ. If Rprev is defined, the entry inthe next table at location Rprev is set to R.seq.

MEMORY-SCHEDULE maintains the issued register, which points to the nextmemory operation to be dispatched to the data cache. It uses thisregister to read four entries from the next and ordering tables. If anymemory ordering links can be formed i.e., next table entries are notempty, the memory operations are dispatched to MEMORY-OUTPUT and theissued register is advanced. The store buffer supports the decoupling ofstore-data from store-addresses, which is done with a hardware structurecalled a partial store queue, as described below. The salient point forMEMORY-SCHEDULE, however, is that stores are sent to MEMORY-OUTPUT evenif their data have not yet arrived.

MEMORY-OUTPUT reads and processes dispatched memory operations. Foursituations can occur: (1) the operation is a load or a store with itsdatum and is sent to the data cache; (2) the operation is a load or astore and a partial store queue exists for its address; the memoryoperation is sent to the partial store queue; (3) the memory operationis a store, its datum has not yet arrived, and no partial store queueexists for its address; in this case, a free partial store queue isallocated and the store is sent to it; and, (4) the operation is a loador a store, but no free partial store queue is available or the partialstore queue is full; the operation is discarded and the issued registeris rolled back.

FIG. 24 illustrates exemplary store buffer logic and structures 300 thatare needed to order a single wave of memory requests. An ordering table302 has 32 entries received from an input arbitration block 304; each iscomposed of four banks that are interleaved to allow four consecutiveentries to be read or written each cycle. The ordering table is 130 bitswide, large enough to hold an address and the memory request opcode. Anext request table 306 has 32 entries and is five bits wide and tracksnexto information for the wave.

In this exemplary design, each store buffer contains two partial storequeues 308 a and 308 b, each of which can hold four memory requests.Each partial store queue has one read port 310 and one write port 312.In addition, a two-entry associative table 314 detects whether an issuedmemory operation should be written to one of the partial store queues orbe sent to data cache 316. Doubling the number of partial store queuesincreases performance by only 9% on average, while halving the numberreduces it by 5%.

Each store buffer requires 0.6 mm² to implement. Four of these storebuffers occupy 2.4 mm per cluster. Of this, the partial store queuesoccupy 0.02 mm². This design achieves a clock speed of 25 FO4. It is theslowest of any component in the design and sets the clock rate for thedevice.

Caching and Coherence

The rest of the WaveCache's memory hierarchy comprises a 32 KB, four-wayset associative L1 data cache at each cluster, and a 16 MB L2 cachedistributed along the edge of the chip (16 banks in a 4×4 WaveCache). Adirectory-based multiple reader, single writer coherence protocol keepsthe L1 caches consistent. All coherence traffic travels over theinter-cluster interconnect.

Larger and smaller size caches have been explored. The data are largelycommensurate with what one observes in traditional microprocessors; alarger cache helps performance. The baseline design, with 32 KB of cacherequires 0.4 mm² per cluster to implement. It is contemplated thathardware designers will choose an appropriate cache size, depending upontheir application mix and area constraints, as they do with allprocessors.

The L1 data cache has a three-cycle hit delay (two cycles SRAM access,one cycle processing), which can be overlapped with the store bufferprocessing for loads. The L2's hit delay is 14-30 cycles depending uponaddress and distance to a requesting cluster. Main memory latency ismodeled at 1000 cycles. In the exemplary RTL model, the L1 caches occupy4% of the cluster's area. It is assumed that the L2 is off-chip.Doubling the size of the L1 data caches improves performance by only 3%.Additional cycle delays of larger caches begin to appear only at the 128K (one additional cycle), and 256 K (two additional cycles) sizes.Shrinking the data cache to 16 K has negligible effect on access time.

Scaling the WaveCache

The previous three sections described the microarchitecturalimplementation of the WaveCache. The exemplary RTL was tuned and builtaround the “baseline” WaveCache above. That design requires 252 mm in a90 nm process. As with all processors, however, its core memory sizesand bandwidths can be tuned for different market segments. Two ends ofthe design space are briefly described. The performance andconfiguration of these alternatives are depicted in a bar graph 320 inFIG. 25. For each WaveCache size, the application was run with theoptimal number of threads for that number of clusters.

At the low end is a 18 mm² single-cluster WaveCache. The results showthat this small WaveCache achieves essentially the same performance onsingle-threaded C applications as the full-size WaveCache. Obviously, itdoes not perform as well on parallel applications, such as Splash-2,because there are fewer execution resources available and, therefore,insufficient room to support multiple concurrent threads.

At the high end, a large 8×8 grid of clusters is possible for futureWaveCaches. Such a design does not fit in 90 nm technology, but becomespractical at the 45 nm technology node, where it is estimated that it isroughly 250 mm². This design does little for single-threaded Capplications, but increases Splash-2 performance by 2.8 times andfine-grained dataflow kernel performance by a factor of two relative tothe 4×4 baseline configuration.

Flowcharts Illustrating Exemplary Logical Steps

Several flowcharts illustrate exemplary logical steps that areimplemented when carrying out the techniques described above. FIG. 26illustrates a flowchart 330 showing steps employed in processing amemory operation, when both ordered and unordered memory operations canbe employed. This process starts with receipt of a memory operation in ablock 332. A decision step 334 then determines if the memory operationis wave ordered, and if not, a step 336 performs or executes the memoryoperation without regard to any ordering routes. Conversely, if thememory operation is ordered, a decision step 338 determines if thememory operation is a memory fence. If not, a step 340 performs thememory operation in accordance with the wave-ordering rules, asexplained above. However, if the memory operation is a memory fence, astep 342 waits for all prior ordered memory operations to complete,according to the wave-ordering rules. Then, a step 344 sends anacknowledgement token to targets of the memory fence instruction toindicate the completion of the ordered memory operations. Processingthen continues.

FIG. 27 illustrates a flowchart 400 showing exemplary logical steps forcombining ordered and unordered memory operations using a sequence startinstruction. The process begins at a step 420 in which one or moreordered memory operations are started. Steps 422 and 424 then areimplemented in parallel. In step 422, a memory sequence start operationis initiated, while in step 424, the ordered memory operations areallowed to complete. After both steps 422 and 424, a step 426 indicatesthat a memory sequence stop operation completes. A step 428 thenexecutes the next unordered memory operations, followed by a step 430,which executes another memory sequence start operation. One or moreordered memory operations can then be started in a step 432.

A flowchart 450 in FIG. 28 is very similar to flowchart 400 and some ofthe same reference numbers have been applied to the same steps as theyoccur therein. After starting one or more ordered memory operations instep 420 of FIG. 28, steps 452 and 424 are carried out in parallel. Instep 452, a memory fence operation is initiated, in parallel with thecompletion of ordered memory operations in step 424. Then, in a step454, the memory fence operation completes, and step 428 provides forexecuting unordered memory operations. A step 456 again executes amemory fence operation, before step 432 starts one or more orderedmemory operations.

A flowchart 500 in FIG. 29 illustrates exemplary logical steps forimplementing partial store control. In this flowchart and in thefollowing discussion, the term “partial store queue” is abbreviated asPSQ. The logic begins with receipt of a memory operation message in ablock 502. A decision step 504 determines if the memory operation is astore address operation. If not, a decision block 506 determines if aPSQ is already allocated for this address (i.e., the current memoryaddress). If not, a step 508 sends the operation to the data cache,which terminates the logic in this flowchart. However, an affirmativeresponse leads to a decision step 510, which determines if the operationcontains datum completing the store-address operation at the headaddress of the PSQ for this address. If so, a step 512 applies allmemory operations in the PSQ, from the head to the oldest incompletestore operation. A step 514 then frees the PSQ for subsequent use.

If, in decision step 504, the operation is a store address operation, adecision step 516 also determines if the PSQ is already allocated forthis address. Although decision steps 506 and 516 are identical, theylead to different logical steps, based on the results of thedetermination made in decision step 504. An affirmative response todecision step 516 or a negative response to decision step 510 leads to adecision step 518, which determines if the PSQ is full. If so, the logicrejects the current memory operation temporarily (until the PSQ is nolonger full). A negative response to decision step 516 leads to adecision step 522, which determines if there is a free PSQ. If not, thelogic also proceeds to step 520, to temporarily reject the memoryoperation being processed. Otherwise, the logic proceeds to a step 524,which allocates the PSQ to processing memory operations with the sameaddress as this one (the current memory operation). A step 526 thenplaces the memory operation in the PSQ, as discussed above.

Although the present invention has been described in connection with thepreferred form of practicing it and modifications thereto, those ofordinary skill in the art will understand that many other modificationscan be made to the present invention within the scope of the claims thatfollow. Accordingly, it is not intended that the scope of the inventionin any way be limited by the above description, but instead bedetermined entirely by reference to the claims that follow.

1. A method for synchronizing a plurality of threads in a dataflowprocessing architecture, comprising the steps of: (a) providing tags tobe used in identifying each individual dynamic instance of data usedwhen executing the thread; (b) annotating data values used in theplurality of threads, to include a specific thread identificationindicating a specific thread with which each data value is associated,the thread identification being used in tokens for the instructionsemployed in the plurality of threads; and (c) providing a threadcoordinate instruction that executes when a data value of a first tokensupplied to a first input of the thread coordinate instruction matches athread identification of a second token supplied to a second input ofthe thread coordinate instruction, producing an output token having atag of the first input, and a data value from the second input.
 2. Themethod of claim 1, wherein the first token is produced by a first threadthat provides the first token, and the second token is produced by asecond thread that is different than the first thread, the threadcoordinate instruction forcing the first thread to await receipt of thesecond token from the second thread before continuing execution of thefirst thread.
 3. The method of claim 1, wherein the thread coordinateinstruction is used to implement a plurality of differentsynchronization objects used for synchronizing interactions between theplurality of threads.
 4. The method of claim 3, wherein thesynchronization objects enable at least two different threads to share acommon resource at different times.
 5. The method of claim 1, furthercomprising the step of employing the thread coordinate instruction toimplement fine-grained parallelism processing by the plurality ofthreads, by enabling data to be passed between the plurality of threadsfor processing using unordered instructions.
 6. A method for managingmemory ordering hardware to allow storing of memory addresses and memorydata comprising memory operations, so that a memory address and memorydata can be supplied to the memory ordering hardware at different times,comprising the steps of: (a) providing a partial store structure fortemporarily storing memory addresses and memory data for memoryoperations, where the memory addresses and the memory data for a memoryoperation arrive at the partial store structure at different times; (b)if a specific memory operation inserted into the memory orderinghardware is a memory load or a memory store, and if the memory addressfor the memory operation is already stored in the partial storestructure, transferring the specific memory operation to the partialstore structure; (c) if a specific memory operation is a memory store,but a memory datum for the specific memory operation has not yet arrivedat the memory ordering hardware, and if a partial store structure doesnot yet exist for the memory address of the specific memory operation,providing a new partial store structure for temporarily storing thememory store until its datum arrives at the memory ordering hardware;and (d) once both the memory datum and the memory address for thespecific memory operation have been inserted into the memory orderinghardware and temporarily stored in a partial store structure,transferring the memory datum and the memory address for all memoryoperations in the partial store structure to another portion of a memorysystem.
 7. The method of claim 6, wherein if a specific memory operationis a memory load operation or a memory store operation, but there is notan available partial store structure in the memory ordering hardware, orif the partial store structure that would otherwise be used is full,recovering from an overflow condition of the partial store structure. 8.The method of claim 7, wherein the step of recovering from the overflowcondition comprises the steps of: (a) discarding the specific memoryoperation; and (b) one of: (i) rolling back a memory operation issuedregister to account for discarding the specific memory operation; and(ii) allowing memory operations from more than one address to occupy apartial store structure.
 9. The method of claim 6, wherein if thespecific memory operation is a memory load operation or a memory storeoperation that includes both the memory address and the memory datum,further comprising the step of transferring the memory datum and thememory address for the specific memory operation directly to anotherportion of the memory system, without using the partial store structure.10. The method of claim 6, further comprising the step of initiallyordering memory operations included in a process, based upon wave numbertags that are assigned to identify each individual dynamic instance ofdata, wherein the wave number tags are assigned by dividing a controlflow graph of the process into a plurality of waves.